Everywhere you look right now, it’s impossible to avoid the existence of generative artificial intelligence (AI). From ChatGPT to image creators like Stable Diffusion, the industry has ballooned from almost nothing into a global super-industry. But not everyone is happy. In January 2023, image licensing company Getty Images started legal proceedings against the owners of AI image creation app Stable Diffusion over its alleged breaching of copyright laws.
But these legal battles carry more than just the future of generative AI on their shoulders, and could affect the entire future of AI art, content creation and the ability to control how our personal data is used.
The reasons for the court case are pretty simple on the surface. Getty Images, as an image licensing platform, charges a fee for users to access or use images. That system poses a major problem for generative AI systems like ChatGPT or Stable Diffusion, which are reliant on mass data scraping to train their systems on how to answer prompts.
“Training these generative AI models involves vast amounts of data,” says Laura Houston, an expert in copyright law and a partner at law firm Slaughter and May. “For example, in text to image models, you’ve got this need to feed it with hundreds of millions of data points to teach the model to find statistical relations between the words and images.”
Simply put – if an AI image creator wants to work out how to create a picture of, say, a chicken wearing a top hat – it needs to study as many images as it can of chickens and top hats. The sheer scale of the data it needs to learn that ability makes it impossible to meaningfully sift the copyrighted from the un-copyrighted images.
“You’ve got the intellectual property [IP] infringement risk that flows from use of that data to teach the AI model,” she says. “But then you’ve also got the question of what the AI model generates as a result, and whether by virtue of the data it’s trained on, the output of the model risks infringing the IP of that input data.”
This is not all just an intellectual exercise. Copyright law is what underpins the ability of all artists and content creators to be able to protect and control, and thus actually make money from, their work. If generative AI is able to cut straight through that, and use their work to train its systems, it could profit while decimating cultural industries worldwide.
But the legal and moral questions don’t stop with copyright laws. Generative AI and large language models have increasingly been falling foul of data protection regulators, too.
Already, the Italian data regulator has banned Open AI-based chatbot Replika from gathering data in the country.
“Publicly available data is still personal data under the GDPR [General Data Protection Regulation] and other data protection and privacy laws, so you still need a legal basis for processing it,” says Robert Bateman, a data protection expert. “The problem is, I don’t know how much these companies have thought about that… I think it’s a bit of a legal time bomb.”
The personal data breaches are often also pretty strange. Last month, FT journalist Dave Lee found out ChatGPT was giving out his Signal number (posted on his Twitter account) as the chat bot’s own number, and was subsequently inundated with random messages. Even that kind of publicly posted data falls under data protection laws, according to Bateman.
“There is such a thing as contextual privacy,” he says. “You might put your number up on Twitter, and not expect it to appear in a database in China. The same goes for you not [necessarily] expecting it to become the output of chatbots. Data accuracy is one of the principles of the GDPR. You are obliged to make sure personal data in your processes is accurate and up to date.
“But large language models hallucinate about 20% of the time, apparently. On that basis, there’s going to be a lot of inaccurate information about people being distributed.”
But for data protection and IP alike, a major concern is working out exactly if a generative AI actually has broken the law. The sheer amount of data fed into these systems makes parsing what is and is not problematic an issue. Meanwhile, the output is never an absolute copy of what was fed in, making it somewhat harder to prove a breach from most copyright cases, which are usually about direct copying.
That point is where large language models like ChatGPT and generative image AI such as Stable Diffusion see a divide. Distorted AI-generated images, more so than text, often carry more definitive clues to the data that helped create them. The Getty case, for example, overcomes a lot of the evidential challenges in this area simply based on the fact that its own watermark has allegedly been appearing on a lot of Stable Diffusion’s output.
“I think it’s possibly no coincidence that many of these initial legal challenges are cropping up in the world of text-to-image AI models,” says Houston.
It is also likely no coincidence the case was filed in the UK. The US, unlike the UK, has a “fair use” defence for copyright infringement that would make things a lot more friendly to big AI developers.
Meanwhile, the UK has a specific text and data mining exception for copyright law – but it isn’t extended to cover commercial uses of those breaches, which current generative AI systems are already doing.
Nominally that would suggest that personal data and content created in the UK is safer – but parliament and the government’s Intellectual Property Office are already in discussions about whether to widen that law, removing the protections for the commercial exploitation of other people’s content.
Ultimately, the inescapable bind for the courts and policymakers alike is the same; that they now have to choose whether to sacrifice the copyright protections of content creators (and privacy protections of everyone) on the altar of the billions or even trillions of pounds of economic value likely to be provided by the generative AI sector.
While Houston cites the case of Spotify, where “rights holders and tech players were able to eventually reach a landing”, there are some complications to working out a similar compromise here. Attribution – a common solution elsewhere in IP cases – is also a struggle.
“I think the big problem is with large datasets of images or text that they’ve got to use, and I’m unaware of a way the original artists could be attributed somewhere,” says Chen Zhu, an associate professor at Birmingham University’s law school, specialising in Intellectual Property Law.
Moreover, those Computer Weekly spoke to questioned whether it’s feasible, if you’re not even sure your personal data is being harvested, to ask for it to only be published correctly, let alone make sure it isn’t used, or for companies to consult manually with artists about the inclusion of their work in the systems.
Either way, we’re unlikely to see much movement any time soon. Almost all of those Computer Weekly spoke to agreed it would be two years at least before we see any headway in the legal cases filed by the likes of Getty, and by then, generative AI may have already become, as Bateman put it, “too big to fail”.
Indeed, the sector is already backed by some major finance. Open AI is supported by Microsoft, for example, while Stable Diffusion has already raised over $101m dollars in venture capital cash and is now seeking a $4bn valuation.
Meanwhile, as Zhu notes, Napster was an industry “underdog” without institutional support or huge sums of venture capital. He cites cases such as when Google digitally copied millions of books for an online library without permission. By the end of the lengthy and costly legal fight with aggrieved authors, the tech giant emerged victorious. “My observation is that companies like Google have been invincible in relation to copyright litigation in the past and have never lost so far,” says Zhu.
Ultimately, the biggest difference between the Napster case and this new raft of cases, which will likely determine the outcome, is that the organisations being challenged this time have money.