Microsoft's Copilot Researcher now runs two competing AI models in sequence: OpenAI's GPT generates draft responses, and Anthropic's Claude reviews them for accuracy, completeness, and citation integrity before delivery to users. The company announced the feature March 30.
KEY TAKEAWAYS
The new Critique feature scored 57.4% on the DRACO benchmark, a standardized test covering 100 complex research tasks across medicine, law, and technology domains developed by Perplexity AI. Claude Opus 4.6 running alone scored 42.7% on the same test, making the multi-model approach 34% more effective.
Microsoft said the feature "separates generation from evaluation" in a blog post, with GPT optimizing for coverage breadth while Claude checks for factual errors and citation problems that single models typically miss.
GPT searches the web and internal documents to generate an initial draft, then Claude evaluates it across three dimensions: factual accuracy, citation quality, and completeness of coverage before anything gets delivered to the user.
"This approach reduces hallucinations and improves source attribution," Microsoft told Reuters. "One model generates, another validates. We're not asking a single model to check its own work."
Microsoft announced model choice options back in September 2025, letting users pick between OpenAI and Anthropic for specific tasks. "Critique differs by using both models automatically on every Researcher query without user intervention," the company said in its blog post.
Better quality doesn't fix adoption problems. AI Business Weekly reported that only 3.3% of Microsoft 365's 400 million users pay the $30 per month premium for Copilot access, representing "the lowest paid adoption of any major AI platform relative to its addressable user base."
The research firm called it "genuine usage and trust challenges that limit active adoption among provisioned users," which translates to: companies buy licenses, employees don't use them.
NoJitter identified four barriers holding back adoption. "Confusing naming conventions across Microsoft's product line, uncertain return on investment, compliance concerns about data handling, and limited training on effective prompting techniques," the publication reported. Organizations pay for Copilot licenses but employees don't integrate the tool into daily workflows.
Microsoft bundled Sales, Service, and Finance Copilots into the core offering at no additional cost beginning October 2025, but the 3.3% conversion figure stayed flat.
Copilot Cowork breaks user goals into coordinated multi-step actions across Microsoft 365 apps without requiring explicit prompts for each step. The feature is in early access via Microsoft's Frontier program.
Cowork uses Claude to plan and execute task sequences, so asking it to prepare a presentation based on recent sales data triggers automatic retrieval from Excel, chart generation, slide drafting in PowerPoint, and summary emails via Outlook without additional prompts.
Current Copilot requires explicit user prompts for each step, while Cowork infers the full sequence and executes it.
Critique is tactically sound engineering since using rival models in sequence hedges against single-model failure modes where GPT hallucinations get caught by Claude review and Claude's narrower context window gets supplemented by GPT's broader coverage.
Quality improvements don't fix adoption crises at scale. Microsoft has spent billions integrating AI across its product stack, yet the 3.3% conversion rate suggests customers either don't see the value proposition or don't trust the output quality enough to justify spending.
Enterprise procurement budgets run on demonstrable ROI metrics: time saved per employee, error rates reduced, customer satisfaction improved. Microsoft hasn't published those operational improvement numbers at scale, making the $30/month premium a harder sell to finance departments. Benchmark scores matter to AI researchers publishing papers, not to CFOs.
Latency is the other concern: running two models in sequence takes longer than running one. Microsoft didn't disclose response times for Critique versus standard Researcher queries. Knowledge workers evaluating AI tools care about speed as much as accuracy.
Autonomous task planning removes real cognitive overhead. Most AI assistants require users to break complex tasks into discrete prompts, and Cowork eliminates that friction. The trust question: do enterprises already skeptical of AI tool reliability want agents modifying production documents and sending emails without human review at each step?
Microsoft's bet is that higher quality output measured on academic benchmarks plus more autonomous agents will convert the 96.7% of Microsoft 365 users currently not paying for Copilot. Adoption data suggests otherwise: marginal benchmark improvements don't persuade enterprises that need proof the tool changes how work gets done, not proof it scores better on a Perplexity AI test.
TLDR
Microsoft 365 Copilot's Researcher agent now uses two competing AI models in sequence. OpenAI's GPT drafts the response while Anthropic's Claude reviews it for accuracy, completeness, and citation quality. The combined system scored 57.4% on the DRACO research benchmark, higher than either model alone. Microsoft faces an adoption problem: only 3.3% of its 400 million Microsoft 365 users pay the $30/month premium for Copilot.
SOURCES & CITATIONS
- Introducing multi-model intelligence in Researcher, Microsoft staff, Microsoft Community Hub, 30 March 2026
- Microsoft unveils AI upgrades, rolls out Copilot Cowork to early-access customers, Reuters staff, Reuters, 30 March 2026
- Evaluating Deep Research Performance in the Wild with the DRACO Benchmark, Perplexity AI Research, Perplexity AI, March 2026
- Microsoft Copilot Statistics 2026: Users & Adoption, AI Business Weekly staff, AI Business Weekly, March 2026
- 4 obstacles impede paid Microsoft 365 Copilot adoption, NoJitter staff, NoJitter, 2026
FREQUENTLY ASKED QUESTIONS



