Like so many innovations, generative artificial intelligence began its life in academic research labs. Long before ChatGPT was publicly released in 2022, large language models were being used for many kinds of scholarly research. ImageNet, the focus of one of the lawsuits we will discuss here, celebrated its 10th birthday in 2019, making it well into its teenage years today.
Of course, the public release of generative AI tools has led to a lot of lawsuits over copyright issues. The ChatGPT is Eating the World website has a case tracker that shows 107 such cases in U.S. courts. Anyone interested in a broader discussion of the basis of some of these cases could look at my article on authorship, copyright and AI. The majority of those cases are focused on AI tools that are publicly available. But recently, we are starting to see lawsuits that reach back into the research and development of AI in academic laboratories.
In EVOX v. Stanford University, a case before the District Court in northern California, it is the 17-year-old ImageNet database that is at issue. ImageNet was an important tool in the development of LLMs, playing a large role in research that showed how increasing the size of a training dataset enhanced the performance of models. Interestingly, this case actually does not challenge the use of ImageNet in the training of generative AI models. In fact, that important research is referred to offhandedly in a single short paragraph in the complaint: “Based upon EVOX’s investigation, Stanford initially used the images it gathered to develop a computer program.” (Para. 17). Instead, the complaint focuses on subsequent uses of images from the ImageNet dataset, alleging that Stanford made a significant number of those images available to the public.
The term “dataset debt,” which I learned from this post about EVOX on Pascal’s Substack, raises the risk that exists as datasets used in research hang around, and become temptations for other uses, potentially for uses that are not as clearly fair use as is the original use. It is a troubling concept, and puts heavy emphasis, in my opinion, on the need for vigilance in the service of protecting open science and open data. If we are careless with data gathered for a fair use, we may discover that data in places where the fair use argument is harder to make (and I offer no opinion about whether or not that happened at Stanford).
Of course, the claim that AI training is itself a fair use of copyrighted materials is very controversial. A significant majority of those 107 copyright cases against AI companies challenge such use. One of those cases, Hendrix v. Apple, is especially interesting because it makes that challenge in a very direct and unadorned way. The Hendrix case does not involve allegations of memorization – claims that the model can directly reproduce at least some copyrighted materials upon request. Instead, the authors who have filed this class action assert that the training itself is an infringement of their copyrights. This makes it an important case to follow because it will allow Apple to lay out a clean fair use argument. Of course, the issue of commerciality will be hotly debated; it has become especially controversial since the Andy Warhol case from the Supreme Court, and Apple is certainly a commercial entity. But the question of a pure research use, even by a for-profit company, can still be answered by fair use. So this case, which, like EVOX, involves a large dataset (called OpenElm) used for LLM training but is not, in this case, accessible to the public, promises to be one of the most significant bellwethers we will see about non-consumptive research using copyrighted materials.