Most companies’ data retention policies were written between 2012 and 2018. They were drafted by outside counsel during a compliance push, approved by a GC who had just read a Target breach post-mortem, and signed off on by a CEO who understood “delete more, sooner” as the correct answer to every question about corporate data.
That was the right answer then. It is the wrong answer now. [S: 2017 you was correct.]
What the old rule was designed to do
The traditional data retention policy had three goals, all defensive.
Reduce discovery exposure. Fewer emails to produce in litigation. Fewer old Slack threads a plaintiff’s lawyer could mine. Fewer messages where a sales rep wrote something stupid in 2014.
Reduce breach surface area. Data you don’t have can’t be exfiltrated, ransomed, or published on a dark web forum in a 2:00 a.m. news cycle.
Reduce reputational and regulatory risk. Fewer records subject to a data subject access request. Fewer items to inventory in a privacy audit. Less to explain to a state AG or a reporter.
Each of these is real. None has gone away. But none was ever the entire story, and the other side of the ledger has changed enormously in the last 36 months.
What changed
Data is not just risk anymore. It is a capital asset, and increasingly the most important one your company owns.
Every model your company trains, fine-tunes, or evaluates (a search ranker, a support classifier, a sales copilot, or a genuine foundation model) gets better with more of your data. Not public data. Not licensed data. Your data. The customer conversations, the support transcripts, the internal wiki, the historical decisions, the edge cases, the things only your company has seen.1
You cannot go back and recreate the last ten years of that. Once it is deleted, it is gone. The competitive advantage it would have conferred is gone with it. Your competitor who kept their data has a training set. You have a policy document. [S: try training a model on a memo.]
Companies building AI today are, in effect, paying for the retention choices their predecessors made a decade ago. The ones who kept more data are compounding. The ones who followed the conservative 2017 playbook are starting from scratch against competitors who didn’t.
There is also a capability that did not exist a decade ago. The reason so many companies deleted aggressively was practical, not only legal. Nobody knew what was in the old ticket queue, the email archive, the support library, the unlabeled S3 bucket. Hand-classifying it was impossible. The safe move was to throw it out. AI has changed that math. A competent pipeline can now map previously unmapped schema at a fraction of the old cost. Flag PII. Classify by sensitivity. Tag by topic. Identify legal-hold material. Redact what needs redacting. Surface what is actually valuable. That cuts both ways. It reduces the risk of keeping the data, because you know what you have and can defend it, scope a DSAR against it, or segregate the sensitive pieces on demand. And it increases the utility, because the same pipeline that makes retention defensible is the one that makes the data trainable.
The stakes are not incremental. The agentic era, where software acts on your behalf, executes workflows across your business, and makes judgment calls against your operational history, runs on proprietary data. Public models plus a thin wrapper is not a moat. Your own data, across years of your own operations, is the moat. A company that deleted its moat because a 2017 memo told it to cannot simply start collecting again and catch up. The compounding has already happened on the other side. This is the kind of mistake that quietly removes a company from the set of companies that can still innovate and compete. Not in five years. In 18 months.
What this means for you
Three tensions most companies have not worked through.
The retention calendar is asymmetric. Privacy regulation in most jurisdictions imposes a ceiling. You cannot keep personal data longer than necessary for the purpose collected. Sector rules (financial, healthcare, employment, tax) impose a floor. You cannot delete certain records earlier than a set period. The window between the two is where your policy lives. Most policies drift toward the floor and call it a day. The correct posture is to understand the full ceiling and justify retention up to it wherever a real business purpose supports it, including AI training, if your notices and lawful basis are scoped to cover it.
The notices and lawful basis are load-bearing. You cannot retroactively decide your 2019 customer service transcripts are training data. If your privacy notice said “we retain this for 24 months to service your account,” you retain it for 24 months. If you want it to be training data in 2027, your notices, terms, and lawful basis under GDPR, CCPA, and the rest have to say so now, in the right language, for the right jurisdictions. This is the part almost everyone has gotten wrong. [M: good luck with retroactive consent.]
The defensibility math has flipped. The 2017 answer was “we delete aggressively because we don’t want to produce this in discovery.” The 2026 answer increasingly is “we retain this because it is a competitive asset, and we have the controls (access logs, encryption, segregation, audit) to protect it.” Litigation risk has not disappeared. It has repriced.
What to do
Three things, in order.
- Inventory what you have and what you’ve been deleting. Most companies cannot answer either question honestly. Start there.
- Rewrite your privacy notices and customer terms to cover AI training, model improvement, and analytics as specified purposes. Do this once, do it correctly, do it in the jurisdictions where you actually have users, not just the ones your GC is most familiar with. Then honor it.2
- Extend retention to the outer bound of what your notices and lawful basis support, not the inner bound your old policy assumed. Put controls around it. Segregate training data from production. Log access. Encrypt it. Audit it.
The companies that do this in 2026 will have a training advantage in 2029 that the ones who don’t cannot catch up to.
If your data retention policy was last updated before LLM tools were in general use, it’s not addressing the actual retention surface area.
Talk to a Talairis attorney →A closing thought
Why are you still deleting?
If the answer is “because that’s what the policy says,” you don’t have a policy. You have a 2017 artifact, and your competitors are happy about it.
- The single most often-deleted category is customer service transcripts. Five years of detailed, edge-case-rich, context-heavy interactions with actual customers. Most companies throw them out at 12 to 24 months because the 2017 memo said so. Those transcripts are the training set the company won’t have when it tries to build the support copilot. ↩
- Privacy notices and lawful basis travel together. If the notice doesn’t say AI training, the lawful basis doesn’t reach AI training, and no amount of post-hoc rewriting fixes it for data already collected. The rewrite has to happen before the next batch of data is captured. ↩