All posts

Can Your Cloud Provider Train AI on Your Files? Most Can. Here's the One Architecture That Can't

LinkedIn, Meta, and Atlassian all flipped stored content into AI training data by default. The only providers that structurally can't are the ones holding nothing but ciphertext.

The direct answer

Most mainstream cloud providers can read your files, so they can also feed those files to AI training — and several already do, on by default, with the opt-out buried in admin settings. The only architecture that structurally can't is zero-knowledge encryption: the provider holds ciphertext and never sees your keys, so there is no plaintext to train on. Not a promise. A property of the design.

This stopped being hypothetical in 2025

The "your data trains our AI" clause used to live in the part of the terms nobody reads. Now it's the default toggle, and the rollout has been relentless.

  • LinkedIn switched on a setting called Data for Generative AI Improvement, enabled by default. For members in the EU, EEA, Switzerland, Canada, and Hong Kong it took effect 3 November 2025; in the US it had already started. It covers your profile and public posts (private messages excluded), and opting out only stops collection going forward — anything already pulled into training stays.
  • Meta began training on Europeans' public Facebook and Instagram posts after an opt-out window closed on 27 May 2025, leaning on "legitimate interest" as its legal basis rather than asking for consent. The Austrian privacy group noyb sent a cease-and-desist over exactly that framing. Public posts and chats with Meta's assistant were in scope; private messages were not.
  • Atlassian is the one that should make anyone storing work in the cloud sit up. From 17 August 2026, in-app content from Jira and Confluence — work-item titles, descriptions, comments, Confluence page bodies — gets collected by default to train Atlassian Intelligence across its roughly 300,000 cloud customers. And the opt-out is tiered: Free, Standard, and Premium plans can't opt out of metadata collection at all, while only Enterprise gets it off by default. Privacy becomes a function of your invoice.

The pattern repeats every time. The capability already existed — the provider could read the data — so flipping it from "stored" to "training corpus" was a settings change and a paragraph in the terms, not an engineering project. That's the part worth sitting with.

Why "we won't train on your files" is a policy, not a guarantee

When Dropbox shipped AI search with a toggle that could route files to OpenAI, the backlash wasn't really about whether the toggle was on or off. (Outside the EU, UK, and Canada, it was on.) It was about the toggle existing at all — that a folder of tax returns and contracts was technically reachable, and the only thing between it and a third-party model was a default the company controlled. Dropbox said data is only sent when you actively use AI features, is never used to train its own models, and is deleted from OpenAI within 30 days. Probably true. But that's a promise that can be rewritten in a terms update, by a future executive, under a future business model. We dug into where Dropbox actually stands in our honest look at Dropbox alternatives.

A policy is a statement of current intent. It lasts exactly as long as the incentives behind it. Three years ago, "we'd never train AI on customer files" was an easy thing to promise, because there was no AI worth training. Today every one of these companies has a model to feed, a competitor shipping AI features, and shareholders asking why the data sitting on their servers isn't earning its keep. The promise didn't get less sincere. The pressure on it got enormous.

What zero-knowledge actually changes

Zero-knowledge isn't a stronger promise. It removes the ability to make the promise matter at all. With end-to-end encrypted, zero-knowledge storage done correctly, encryption happens on your device before anything is uploaded. The provider receives an opaque blob and stores it. They never hold your keys.

At Beebeeb the chain looks like this. Your passphrase is stretched with Argon2id (256 MB memory, 4 iterations) into a key that never leaves your device. File contents are sealed with AES-256-GCM; sharing uses X25519 for key exchange. Login uses OPAQUE, so the server never even sees your password. Keys are zeroized from memory after use. The server's copy of your file is ciphertext and nothing else.

Here's the consequence policy can't replicate: to train a model on your files, you need the plaintext. To get the plaintext, you need the key. The key is on your device, derived from a passphrase the server has never seen. No admin setting, no terms update, no future executive, and no government request conjures plaintext out of ciphertext. The provider can't train on your files for the same reason they can't email them to your ex — they physically don't hold them in readable form. The full mechanism, threat model, and the things we explicitly cannot protect against are on our security page.

The trade-offs, said out loud

This isn't free, and we won't pretend it is. Because we can't read your files, we can't offer server-side AI search, content-aware previews, or "find that photo of a receipt" features that depend on the provider seeing your data. Those are genuinely useful, and providers that can read your files can build them. We give that capability up on purpose, and we'd rather tell you than imply you get everything plus privacy.

There's a harder edge most marketing skips: if you lose your passphrase and your BIP39 recovery phrase, we can't get your data back. We can't reset what we can't read. That's the same property that stops us training AI on your files, seen from the other side. We say it plainly, because the honest version of zero-knowledge includes the part where it can bite you.

What to check before you trust a provider

  1. Who holds the keys? If the provider can reset your password and hand you your files, they hold the keys — and so could a future AI feature.
  2. Is the AI-training setting opt-in or opt-out? Opt-out by default tells you where the company's instincts point.
  3. Can you opt out at all on your plan? Atlassian's tiering shows "opt-out available" can quietly mean "available if you pay enough."
  4. Is the client code open? Encryption claims you can't inspect are just claims. Beebeeb's product clients — the web app, CLI, and the core crypto library, with native mobile and desktop coming soon — are open source, so the encryption can be read by anyone. (Our server and this site are not, and we don't pretend otherwise.)

Zero-knowledge means the answer to "can you train AI on my files?" isn't "we promise not to." It's "we couldn't if we wanted to." If you'd rather test that property than take our word for it, Beebeeb's free 5 GB tier is zero-knowledge — the same architecture as every paid plan, no plaintext on our servers, nothing to train on.

Files only you can read

Beebeeb is end-to-end encrypted, zero-knowledge cloud storage — stored in Falkenstein, Germany, open source, with a 14-day free trial on every plan. Encryption happens on your device; we only ever hold ciphertext we can’t read.

Join the waitlist See pricing How the encryption works