It is 9:22 on a Tuesday morning. I have hit my usage limit on my go-to frontier LLM. I remember that the same model is included in my enterprise productivity suite’s chat application. I open it. In the model selector, I pick Opus 4.7. I paste in a 14-clause master services agreement with three exhibits and a four-page internal negotiation playbook. I ask for every deviation from the playbook, cited to the clause and the playbook section, with proposed redline language.
What comes back is a five-bullet summary. Two of the bullets reference clauses out of sequence. One cites a section of the playbook that does not exist. The redline proposals are generic. Nothing is table-formatted. Nothing is anchored to the source text.
I run the prompt again. It somehow gets worse. I stare into the monitor blankly.
Look how they massacred my boy.
This is the story of every company in 2026 that bought AI from a platform and assumed the model label on the dropdown was the product. It is happening right now at yours.
The model is not the model
Model quality is not a single variable. The quality you experience is a product of several inputs multiplied together.
The weights (the thing the lab trained) are one input. They are the one everybody names. The others are the system prompt the platform injects, the effective context window the platform allows, the tools the platform exposes and how they are defined, the reasoning budget the platform permits, the temperature and sampling parameters, the retry logic, the post-processing filter, and the router that may or may not send your query to the model you actually selected. [M: you don’t recognize the underperformance.]
Swap any of the non-weight inputs for worse versions and the weights do not save you. The model that scored 92 in the lab’s harness does not score 92 in your vendor’s harness. Nobody told you the number would drop. Nobody showed you the new number. There is no new number, because vendors do not publish benchmark results on their own harnesses. They publish the lab’s.1
Four specific things the harness controls
The system prompt. The lab trained the model with a specific system-prompt shape in mind. The vendor replaced it with their own, which is often longer, more restrictive, and tuned to the vendor’s user base, which is not you. The vendor’s prompt tells the model to be cautious, to avoid certain topics, to prefer short responses, to format in specific ways, to never do specific things. Those instructions override behaviors the lab spent millions of training dollars to instill. The model follows the vendor’s prompt because that is what it is trained to do. You get the vendor’s version of the model, not the lab’s.
The effective context window. The model card advertises 200,000 tokens. The vendor’s backend truncates to 32,000 to save compute cost. Your 14-clause MSA plus three exhibits plus playbook is 48,000 tokens. The platform drops your exhibits before the model sees them. The model answers the question without the exhibits it never received. The answer is worse. You do not see the truncation. You see a confidently wrong answer about documents you were sure the model had read.
The tool definitions. Frontier models in 2026 are trained alongside a specific vocabulary of tools: how to read a file, how to execute code, how to produce a structured output, how to call a search index, how to chain calls. The lab drilled the model on those specific tool signatures for tens of thousands of examples. The vendor exposes a different, narrower set, with different names and different argument schemas. The model handles it, but worse. Benchmarks were run with the lab’s tool set. Your vendor ships a different one and keeps the same benchmark number on the marketing page.
The reasoning budget. Frontier models have an extended-thinking mode where the model reasons through a long chain before answering. The lab lets that chain run to tens of thousands of tokens on hard problems. The vendor caps it at two thousand tokens because reasoning tokens are expensive and the vendor does not charge you extra for them. The model’s best effort is now a shallow pass. You selected Opus 4.7. You got Opus 4.7 with its thinking switched off.
Any one of these, applied silently, degrades output materially. Stacked, they turn a frontier model into something that performs like a midrange model from two generations back.
What the platform may also be doing
Beyond the four above, harder to catch.
Quantization. The “same model” may be a lower-precision version of the same weights, sold under the same label. Quality drops are small but measurable on hard tasks. The vendor does not disclose this.
Older checkpoints. The lab updates the model on a Tuesday. Your platform ships the six-week-old version because upgrading breaks a test suite. You are benchmarking the new one. You are using the old one.
Aggressive prompt caching. The platform caches common prompt prefixes for latency and cost. Your novel context gets concatenated with cached patterns. The model’s output regresses toward the cached examples, which were about other companies’ problems.
Post-processing filters. The model produced the right answer. The vendor’s safety or formatting layer truncated, reformatted, or stripped parts of it. You read the post-processed version and conclude the model is worse than it is.
Internal routing. You asked for Opus 4.7. The vendor’s router decided your query was “easy” and silently sent it to a smaller, cheaper model to save margin. You paid the Opus rate. You got the Haiku answer. [S: bait and switch with extra steps.]
Some vendors are transparent about some of this. Most are transparent about none of it.
Why frontier models degrade out of distribution
Frontier model training, at the current state of the art, includes heavy reinforcement learning against specific prompt patterns, tool-use patterns, and system-prompt shapes. Those training distributions are not arbitrary. The lab chose them because they represent the usage patterns the lab wants to optimize for, which are, overwhelmingly, the patterns in the lab’s own product.
Training transfers out of distribution, but it does not transfer perfectly. The further a vendor’s harness drifts from the training distribution, the more the model’s benchmark performance decays. Vendor harnesses are, by definition, out of distribution. How far depends on how much engineering the vendor put into matching the lab’s expected patterns. Most put in very little, because most do not know what those patterns are, because the lab does not publish them in full. [M: train one harness, ship another.]
This is not a conspiracy. It is an engineering reality nobody has strong incentives to tell you about.
If you’re paying a legal AI platform for frontier model access, the version, the prompt restrictions, and the data flow back to the platform are all worth verifying.
Talk to a Talairis attorney →To be fair
Not every vendor harness is a regression. Some add things a native lab UI does not: retrieval over your own library, citation and source tracking, audit logging, domain-tuned prompts, evaluations against your task set, workflow integration with tools that matter more to your work than to a benchmark. On a narrow legal task with the right harness, a midrange model can outperform a frontier model in bare chat. This is real.
And not every bad output is the harness. Sometimes the prompt was worse. Sometimes the task was genuinely hard. Sometimes the user expected magic. An honest comparison has to be a controlled one: same prompt, same inputs, same task, measured output against a rubric written before the test. [S: blaming the vendor is half right.]
The fair version of this argument is not “every vendor is worse than native.” It is “most vendors are worse than native on high-ceiling tasks, the gap is usually larger than vendors admit, the cause is harness rather than weights, and the vendor has no contractual obligation to disclose or fix it.” That is the version your procurement team needs.
What to do
The cemented answer: run the controlled A/B. Take the ten most important legal tasks your AI platform is doing (contract review, clause extraction, research memos, deposition prep, diligence) and run each one in the vendor’s platform and in the lab’s native product, with identical prompts and inputs. Measure the output against a rubric you wrote before the test. The gap is either small enough to ignore or large enough to act on. Without the data, you have no way to know what you are paying for.
Beyond that, options.
Demand harness transparency in writing: the exact model checkpoint, the effective context window, the system prompt prefix, the tool definitions, the reasoning-token budget, and whether an internal router may downgrade your selected model. What the vendor doesn’t disclose tells you who the serious vendors are. Benchmark the product end-to-end against your own task rubric, repeated quarterly; model labels are not a substitute for measurement. Route high-stakes work to the native path; keep the vendor platform for the volume where convenience is the point.2
Get counsel before the next procurement
The AI your vendor sold you does not say “Opus 4.7” at the weights-plus-harness level. It says some checkpoint, some system prompt, some truncation, some tool set, some router, some filter. The composition of which is almost never in your contract and almost never on the product page. The label on the dropdown is a marketing claim, not a specification.
Before the next AI procurement, before the next renewal, before the next rollout to a team that will depend on the tool for material work, get a counsel conversation. Your contract is the only instrument that can force the disclosures you need to judge what you are actually buying. Those clauses get drafted once, before you sign, or not at all.
A closing thought
The model is not the model. The harness is.
You did not buy weights. You bought a product that shipped some of those weights wrapped in a lot of other decisions, every one of them a place the vendor saved money and you lost performance. The label said frontier. What came back from the model was a shell of what it should have been.
In 2026, the shortest path to worse AI is to pay more for the same model, made worse.
- The vendor’s harness is the spec of the model you’re actually using. The vendor’s contract typically does not reference the harness. The label says frontier. The contract says nothing about what sits between the label and the output. ↩
- The vendor is not going to tell you, “hey, I don’t have enough context tokens for this and I’m going to cut off and just do my best.” The truncation is silent. The output is worse. You don’t see why. ↩