AI

Microsoft Copilot Pits GPT Against Claude in New Researcher Upgrade

New Critique feature uses GPT to draft responses and Claude to fact-check them, achieving a 57.4% score on deep research benchmarks

Split composition showing abstract geometric patterns representing two AI systems working together

Microsoft's new Critique feature pairs rival AI models for higher-quality research output

Editor

By Editor

Mar 31, 2026 · 6 min read

By Alex Mercer · 2026-03-31

Microsoft's Copilot Researcher now runs two competing AI models in sequence: OpenAI's GPT generates draft responses, and Anthropic's Claude reviews them for accuracy, completeness, and citation integrity before delivery to users. The company announced the feature March 30.

KEY TAKEAWAYS

01Microsoft's Copilot Researcher Critique feature uses OpenAI's GPT to generate responses and Anthropic's Claude to review them

02The multi-model system scored 57.4% on DRACO benchmark, beating Claude Opus 4.6 alone (42.7%) by 34%

03Only 3.3% of Microsoft 365's 400 million users pay for Copilot despite multi-billion dollar investment

04Copilot costs $30 per user per month, the highest premium of any major AI assistant platform

05Microsoft also launched Copilot Cowork, a new agent that breaks goals into multi-step actions across apps

The new Critique feature scored 57.4% on the DRACO benchmark, a standardized test covering 100 complex research tasks across medicine, law, and technology domains developed by Perplexity AI. Claude Opus 4.6 running alone scored 42.7% on the same test, making the multi-model approach 34% more effective.

Microsoft said the feature "separates generation from evaluation" in a blog post, with GPT optimizing for coverage breadth while Claude checks for factual errors and citation problems that single models typically miss.

GPT searches the web and internal documents to generate an initial draft, then Claude evaluates it across three dimensions: factual accuracy, citation quality, and completeness of coverage before anything gets delivered to the user.

"This approach reduces hallucinations and improves source attribution," Microsoft told Reuters. "One model generates, another validates. We're not asking a single model to check its own work."

Microsoft announced model choice options back in September 2025, letting users pick between OpenAI and Anthropic for specific tasks. "Critique differs by using both models automatically on every Researcher query without user intervention," the company said in its blog post.

Better quality doesn't fix adoption problems. AI Business Weekly reported that only 3.3% of Microsoft 365's 400 million users pay the $30 per month premium for Copilot access, representing "the lowest paid adoption of any major AI platform relative to its addressable user base."

The research firm called it "genuine usage and trust challenges that limit active adoption among provisioned users," which translates to: companies buy licenses, employees don't use them.

NoJitter identified four barriers holding back adoption. "Confusing naming conventions across Microsoft's product line, uncertain return on investment, compliance concerns about data handling, and limited training on effective prompting techniques," the publication reported. Organizations pay for Copilot licenses but employees don't integrate the tool into daily workflows.

Microsoft bundled Sales, Service, and Finance Copilots into the core offering at no additional cost beginning October 2025, but the 3.3% conversion figure stayed flat.

Copilot Cowork breaks user goals into coordinated multi-step actions across Microsoft 365 apps without requiring explicit prompts for each step. The feature is in early access via Microsoft's Frontier program.

Cowork uses Claude to plan and execute task sequences, so asking it to prepare a presentation based on recent sales data triggers automatic retrieval from Excel, chart generation, slide drafting in PowerPoint, and summary emails via Outlook without additional prompts.

Current Copilot requires explicit user prompts for each step, while Cowork infers the full sequence and executes it.

Critique is tactically sound engineering since using rival models in sequence hedges against single-model failure modes where GPT hallucinations get caught by Claude review and Claude's narrower context window gets supplemented by GPT's broader coverage.

Quality improvements don't fix adoption crises at scale. Microsoft has spent billions integrating AI across its product stack, yet the 3.3% conversion rate suggests customers either don't see the value proposition or don't trust the output quality enough to justify spending.

Enterprise procurement budgets run on demonstrable ROI metrics: time saved per employee, error rates reduced, customer satisfaction improved. Microsoft hasn't published those operational improvement numbers at scale, making the $30/month premium a harder sell to finance departments. Benchmark scores matter to AI researchers publishing papers, not to CFOs.

Latency is the other concern: running two models in sequence takes longer than running one. Microsoft didn't disclose response times for Critique versus standard Researcher queries. Knowledge workers evaluating AI tools care about speed as much as accuracy.

Autonomous task planning removes real cognitive overhead. Most AI assistants require users to break complex tasks into discrete prompts, and Cowork eliminates that friction. The trust question: do enterprises already skeptical of AI tool reliability want agents modifying production documents and sending emails without human review at each step?

Microsoft's bet is that higher quality output measured on academic benchmarks plus more autonomous agents will convert the 96.7% of Microsoft 365 users currently not paying for Copilot. Adoption data suggests otherwise: marginal benchmark improvements don't persuade enterprises that need proof the tool changes how work gets done, not proof it scores better on a Perplexity AI test.

TLDR

Microsoft 365 Copilot's Researcher agent now uses two competing AI models in sequence. OpenAI's GPT drafts the response while Anthropic's Claude reviews it for accuracy, completeness, and citation quality. The combined system scored 57.4% on the DRACO research benchmark, higher than either model alone. Microsoft faces an adoption problem: only 3.3% of its 400 million Microsoft 365 users pay the $30/month premium for Copilot.

SOURCES & CITATIONS

FREQUENTLY ASKED QUESTIONS

How does Microsoft's Critique feature work?

OpenAI's GPT generates a draft response to a research query. Anthropic's Claude then reviews the draft for factual accuracy, citation quality, and completeness before delivering the final answer.

What is the DRACO benchmark?

DRACO is a standardized test developed by Perplexity AI covering 100 complex research tasks across ten domains including medicine, law, and technology. It measures accuracy, completeness, and objectivity in AI research tools.

How much does Microsoft 365 Copilot cost?

Microsoft charges $30 per user per month for Copilot on top of existing Microsoft 365 subscriptions. This is the highest premium of any major AI assistant platform.

What is Copilot Cowork?

Copilot Cowork is a new agent that breaks user goals into multi-step actions across Microsoft 365 apps. It infers required tasks and executes them automatically without prompts for each step.

Why is Copilot adoption low?

Only 3.3% of Microsoft 365's 400 million users pay for Copilot. Barriers include the $30/month price, confusing product naming, uncertain ROI, compliance concerns, and limited training on effective use.

AI OpenAI Anthropic

Editor

The Bushletter editorial team. Independent business journalism covering markets, technology, policy, and culture.

Microsoft Copilot Pits GPT Against Claude in New Researcher Upgrade

Editor

Read Next

9 AEO agencies in Sydney worth shortlisting in 2026

Trump's 100% pharma tariffs hit Australia's $1.3bn export pipeline

Government denies SAS deployment to Middle East reported by News Corp

Five social media giants face A$49.5M fines over teen ban failures

The Morning Brief