Tim O'Reilly: How to Fix AI's "Original Sin"
[Post originally appeared on O'Reilly Radar... excerpt below]
EXCERPT FROM TIM O’REILLY’S POST >
Last month, The New York Times claimed that tech giants OpenAI and Google have waded into a copyright gray area by transcribing the vast volume of YouTube videos and using that text as additional training data for their AI models despite terms of service that prohibit such efforts and copyright law that the Times argues places them in dispute. The Times also quoted Meta officials as saying that their models will not be able to keep up unless they follow OpenAI and Google’s lead. In conversation with reporter Cade Metz, who broke the story, on the New York Times podcast The Daily, host Michael Barbaro called copyright violation “AI’s Original Sin.”
At the very least, copyright appears to be one of the major fronts so far in the war over who gets to profit from generative AI. It’s not at all clear yet who is on the right side of the law. In the remarkable essay “Talkin’ ’Bout AI Generation: Copyright and the Generative-AI Supply Chain,” Cornell’s Katherine Lee and A. Feder Cooper and James Grimmelmann of Microsoft Research and Yale note:
Copyright law is notoriously complicated, and generative-AI systems manage to touch on a great many corners of it. They raise issues of authorship, similarity, direct and indirect liability, fair use, and licensing, among much else. These issues cannot be analyzed in isolation, because there are connections everywhere. Whether the output of a generative AI system is fair use can depend on how its training datasets were assembled. Whether the creator of a generative-AI system is secondarily liable can depend on the prompts that its users supply.
But it seems less important to get into the fine points of copyright law and arguments over liability for infringement, and instead to explore the political economy of copyrighted content in the emerging world of AI services: Who will get what, and why? And rather than asking who has the market power to win the tug of war, we should be asking, What institutions and business models are needed to allocate the value that is created by the “generative AI supply chain” in proportion to the role that various parties play in creating it? And how do we create a virtuous circle of ongoing value creation, an ecosystem in which everyone benefits?
Publishers (including The New York Times itself, which has sued OpenAI for copyright violation) argue that works such as generative art and texts compete with the creators whose work the AI was trained on. In particular, the Times argues that AI-generated summaries of news articles are a substitute for the original articles and damage its business. They want to get paid for their work and preserve their existing business.
Meanwhile, the AI model developers, who have taken in massive amounts of capital, need to find a business model that will repay all that investment. Times reporter Cade Metz provides an apocalyptic framing of the stakes and a binary view of the possible outcome. In his interview in The Daily, Metz opines
a jury or a judge or a law ruling against OpenAI could fundamentally change the way this technology is built. The extreme case is these companies are no longer allowed to use copyrighted material in building these chatbots. And that means they have to start from scratch. They have to rebuild everything they’ve built. So this is something that not only imperils what they have today, it imperils what they want to build in the future.
And in his original reporting on the actions of OpenAI and Google and the internal debates at Meta, Metz quotes Sy Damle, a lawyer for Silicon Valley venture firm Andreessen Horowitz, who has claimed that “the only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data. The data needed is so massive that even collective licensing really can’t work.”
“The only practical way”? Really?
I propose instead…
(read the full post for Tim O’Reilly’s solution)