Imagine opening your inbox one morning to find that your name appears among a vast, once-invisible ledger: authors whose books were ingested by a large AI developer. That was me earlier this week. Two of my books — works I conceived, wrote, and released under proper copyright — are apparently listed among the several hundred thousand that were used by Anthropic to train its large-language models.
That moment hits differently when you’ve spent years thinking about data, machine learning pipelines and intellectual property. It’s one thing to talk theoretically about the dangers of unlicensed scraping or dataset opacity. It’s quite another to find your own name on a list you never asked to be on.

The settlement: a corrective instrument — but by no means a full remedy
In September 2025, Anthropic agreed to pay USD 1.5 billion to settle a class-action lawsuit brought by authors who alleged the company used pirated copies of their books for training.
Key points under the deal:
- Anthropic must destroy the pirated-books libraries and any derivative copies used internally.
- The settlement covers past use only — it does not grant a blanket licence for future ingestion of authors’ works. Any new usage remains outside the settlement.
- The funds are distributed on a per-work, equal basis, subject to deductions (legal fees, administration). If there are multiple rightsholders (co-authors, publishers), the payment is split accordingly.
For me, and for many others, this represents a concrete — if modest — recognition. It is better than nothing. But the payout surely doesn’t reflect the true creative value of a book: its intellectual labour, the ecosystem of readers, the potential future royalties or derivative work.
What this episode reveals about the misuse, mis-attribution and dilution of creative work
• Creative labour becomes “data points”
Large volumes of written works — novels, non-fiction, technical books — were reportedly downloaded, stripped of metadata, and absorbed into generic “training corpora.” This transforms books from expressive creations into mere text blobs. The original authors vanish from the metadata, alongside their ownership, context and consent. In short: creative labour becomes raw fuel.
• Consent and control evaporate with scale
The speed and scale at which AI developers can scrape data make traditional author consent or licensing agreements impractical unless there is structural enforcement or legal incentive. This threatens to normalise the idea that creative works are “free for mining,” undermining authors’ control over how — or whether — their works are used. It is a systemic shift, not just a few bad actors.
• Market and value dilution — for authors and readers
Once an AI model is trained on thousands of books and can synthesise or summarise their ideas, the perceived need (and therefore market) for the original works may decrease. This risk is exacerbated for technical or niche works (like mine) whose value depends on depth, structure, and the author’s voice. The economics get skewed.
• Settlements are remediation — not restoration
The settlement may compensate authors financially for past unauthorised use. But it does nothing to restore authorship, attribution, control, or moral rights. Nor does it guarantee future respect for those rights. In the long run, the architecture of AI training needs rethinking.
For authors, AI practitioners and the community — what we must do now
Given my dual identity as both technologist and author, I see this not simply as a “gotcha” moment, but as a tipping point. Here’s what I think should happen next — and why all stakeholders should pay attention.
- Authors and rights-holders should assert proactive control. In jurisdictions like the UK, regulatory reforms are under discussion to give right-holders meaningful control over whether their works can be used for AI training. Collective licensing frameworks, metadata protection, opt-out registries — these will matter.
- AI developers must embrace responsible data sourcing. Building models on legally licensed, consented, or public-domain corpora must become the norm. Scraping shadow libraries or ambiguity-laden datasets must no longer be an accepted shortcut.
- Transparency and traceability should be standard. Just as industrial data engineering tracks provenance, versioning and lineage, AI training pipelines should log data origin, rights status, and consent metadata. This isn’t just ethical — it’s essential for long-term sustainability of both the AI and creative ecosystems.
- Policymakers and the public need to demand balance. The goal should not be to stifle innovation — AI can bring enormous value. But it must not do so at the expense of creators. Legal frameworks should evolve to support licensing, fair remuneration and respect for authorship, while allowing AI research to proceed under agreed terms.
My personal stance — and why I write this
Having worked in data/ML for over two decades, and having published technical books myself, this situation lands uncomfortably close to both sides of the divide. My professional instinct understands the allure of large-scale textual datasets. My moral and creative instinct recoils at the idea that human labour — years of writing, revising, editing — can be bypassed, ingested, and re-used without notice, consent or attribution.
I don’t believe the settlement is the end of the story. Instead it feels like a first step — a belated recognition, but also a warning shot. If we don’t act now — to demand consent, build transparent training pipelines, and ensure fair remuneration — creative work risks being commoditised into anonymous data bricks.
Because once we accept that, we are no longer just building AI. We’re building on the cultural, intellectual, and emotional labour of real people — and we should treat it with the respect it deserves.