Training generative models, such as those developed by OpenAI, is a complex process that involves using large amounts of data to teach the model to “understand” and generate new data. The word “understanding” is a big word because, as we know, i generative models they use a non-deterministic probabilistic approach so much so that they are often defined stochastic parrots.
In its article “How ChatGPT and Our Language Models Are Developed,” OpenAI provides details on the development of generative models, including those that power ChatGPTand explains how the company addresses the issue of copyright and manages personal information.
The company led by Sam Altman explains that its models are developed and updated using various sources: publicly available information on the Internet, acquired via license from third parties, provided by users or specialized personnel.
OpenAI adds that information publicly accessible online is therefore useful for train the model but those protected via a paywall or those published, for example, on the darkweb are always avoided. During training, the learn model from the information read, improving its ability to predict accurate words in certain contexts. The models of machine learning they are composed of weights or parameters, adapted during training. The model, OpenAI is still keen to underline, does not store or copy the information, but uses it exclusively to adapt the model weights.
OpenAI: training generative models is fair use
In an official document sent by OpenAI to the US Patent Office (USPTO), which is investigating the permissibility of training models based onartificial intelligence with content published online and protected by copyright, the company states that similar activities are expressly permitted by current regulations because they fall within the doctrine of fair use.
What does the principle establish fair use
Il fair use is a US legal principle that permits the use of copyrighted works without the permission of the copyright holder in certain circumstances. This fair use it is considered admissible only if it meets certain criteria: for example if the use is for the purposes of criticism, comment, teaching, research. The principle is defined by Copyright Act approved overseas: the law is based on the balance of various factors.
Nell’European Unionthere is no doctrine of fair use like in the United States. Member States instead apply an approach based on exceptions and limitations of the doctrine to protect the Copyright. The “specific cases” allowed vary from country to country, but generally allow uses for educational, research, criticism and news processing purposes.
It must be said that in 2007, the European Parliament established that copying for the purposes of criticism, review, information, teaching, study or research of material protected by copyright should not be classified as a crime (article 3 of the IPRED2 directive at the time spoke of fair use).
Generative models learn from existing data – there is no copyright infringement
The thesis put forward by OpenAI is clear: since the training phase of generative models “learns” from pre-existing data, for example those shared publicly on the Web, as a real person would do, there is not and cannot be any copyright infringement.
In the document prepared by Sam Altman’s company it is also noted that thelegal uncertainty on the implications of copyright in training AI models tends to significantly increase i costs for developers of AI-based solutions. It is therefore hoped that the issue can finally be resolved in an authoritative and incontrovertible way so as to encourage development and innovation.
OpenAI brings a dispute between the Authors Guild and Google to support its thesis
Sifting through the contents of the document sent to the USPTO, we discover that OpenAI cites the case “Authors Guild contro Google” as an example to support his theses. Google had scanned tens of millions of books without the authors’ permission to include them in a searchable database online. Ten years ago came the historic ruling that proclaimed Google Books a legitimate service built on the principles of fair use.
The judges ruled that Google’s work on original works covered by copyright was “vastly transformative” and as such, the “quotations” (snippet) provided through the Google Books service (Books, in English) could not in any way replace the authors’ publications or cause damage to the latter and to the publishers themselves.
If Google has obtained full acquittal – this is OpenAI’s position – imagine what grounds for dispute there may be for generative models that they do not store any content third party (there are no databases…).
Generative models do not store data in databases and exploit the knowledge acquired to generate new content
However, OpenAI refers to similar disputes such as “Authors Guild contro HathiTrust“, in which the court ruled that scanning entire books into a searchable database constituted fair use. And he “relates” disputes regarding theuse of digital images on a large scale, such asPerfect 10 contro Amazon.com” e “Kelly contro Top Soft“: the use of thumbnails of the original images by search engines constitutes, also in this case, fair use.
In short, the leaders of OpenAI argue that the training of AI systems is even more “transformative” than the examples cited since it goes well beyond the mere preservation of the individual content of the works, building advanced models starting from the entire “corpus” of training and using them for generate content completely new. The examples cited mainly deal with access to specific intellectual works and ingenuity, while AI systems go further, generating something new based on previously constructed learning models.