Module 5: Artificial Intelligence
Text and data mining
According to Directive (EU) 2019/790, "‘text and data mining’ means any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations". The Directive further provides a framework for the use of such techniques.
Brazil
Introduction
Brazil is currently in the process of implementing a national strategy on AI, a process which Brazil began in 2020. The strategy is holistic and considers many different aspects where Brazil is working on legislation, regulation, and AI governance with an eye to ethical use of AI and alignment with international approaches to AI. Brazil’s AI policy also puts a focus on education, workforce & training, research, development and innovation, how AI can be applied in “productive sectors” of the economy and in public bodies, and public security. The overall policy is founded on a goal of stimulating research, innovation and development of AI solutions while maintaining conscious and ethical use of AI. This national strategy is captured in policy documents, including the Brazilian Artificial Intelligence Strategy (EBIA), as well as the National Strategy for Intellectual Property (ENPI) and the National Policy of Innovation.
Pursuant to the Brazilian Artificial Intelligence Strategy, Brazil is working on specific legislation for AI. An initial draft AI Act was introduced in Brazil’s lower legislative chamber (the Chamber of Deputies) in 2020 (PL 21/2020), and was subsequently sent to the upper chamber (the Senate). The Senate convened a committee of experts to hold public hearings and collect public comments, the Comissão de Juristas responsável por subsidiar elaboração de substitutivo sobre inteligência artificial no Brasil (source). Other relevant draft laws were also pending in the Senate, so the Senate issued an order to combine, restructure and improve the text of these proposed laws pending input from the Committee. The Committee held meetings in 2022 and issued a final report in December 2022 with proposed changes to text. According to a press release from August 2024, the Senate is still reviewing draft AI legislation. A new draft text incorporating the recommended language from the committee of experts has been assigned bill number PL 2338/23, and it is this bill that appears to be moving forward in the Senate. The Chamber of Deputies’ original draft text PL 21/2020 has not moved forward; several other draft laws that relate to AI are also pending in the Senate. The Senate now has another commission debating the draft AI Act, which is continuing to meet regularly (source). Thus, it appears Brazil is getting closer to finalizing the draft legislation.
Text and Data Mining in Brazil
With the AI legislation in Brazil not yet finalized, from a copyright perspective, Brazil law currently does not authorize text and data mining (TDM) without permission of the copyright holder. Although Brazil’s copyright act, Law No. 9.610 of February 19, 1998 (Law on Copyright and Neighboring Rights), does include some exceptions to the exclusive rights of a copyright holder, none of these exceptions (in Articles 46, 47, and 48 of ) address text and data mining. [source: https://www.wipo.int/wipolex/en/legislation/details/514]. Thus, the work that Brazil’s Senate is doing on the draft AI will be important to regulate text and data mining activities when the law is finalized and approved.
The most recent and active draft AI Act in Brazil, PL 2338/23, includes the following definition of text and data mining:
\Art. 4. For the purposes of this Law, the following definitions are adopted:
[…] VIII – text and data mining: process of extracting and analyzing large amounts of data or partial or full excerpts of textual content, from which patterns and correlations are extracted that will generate relevant information for the development or use of artificial intelligence systems.
The draft law would regulate text and data mining as follows:
Art. 42. The automated use of works, such as extraction, reproduction, storage and transformation, in data and text mining processes in artificial intelligence systems, in activities carried out by research organizations and institutions, journalism and museums, archives and libraries, does not constitute an offense to copyright, provided that: I – it does not have as its objective the simple reproduction, exhibition or dissemination of the original work itself; II – the use occurs to the extent necessary for the objective to be achieved; III – it does not unjustifiably harm the economic interests of the owners; and IV – it does not compete with the normal exploitation of the works.
§ 1. Any reproductions of works for data mining activities will be kept under strict security conditions, and only for the time necessary to carry out the activity or for the specific purpose of verifying the results of scientific research.
§ 2. The provisions of the caput shall apply to the activity of data and text mining for other analytical activities in artificial intelligence systems, provided that the conditions of the items of the caput and § 1 are met, provided that the activities do not communicate the work to the public and that access to the works has been legitimate. § 3. The activity of text and data mining involving personal data shall be subject to the provisions of Law No. 13,709, of August 14, 2018 (General Personal Data Protection Law) (source) (source).
As noted above, this is only a draft legislation, and this could change as the draft bill is debated further in Brazil’s congress.
The draft TDM definition and exception are notable in being fairly aligned with other jurisdictions such as the European Union. Brazil’s draft definition of TDM goes so far as to stipulate that “large amounts of data or partial or full excerpts of textual content” will be permitted to be mined, which suggests even in this definition the intent to provide an exception from the existing copyright regime. The definition also limits this use to the purpose of “development or use of artificial intelligence systems.” Arguably, this language ensures that the exception will be limited to the AI field. The inclusion of TDM when AI systems are in “use” is an interesting addition, since it suggests that TDM might be permissible not only for the training stage, but also when AI is deployed.
Like other jurisdictions such as the EU, Brazil’s draft TDM exception at Article 42 provides for use of works by research organizations and cultural institutions, namely, museums, archives, and libraries. However, it also notably allows an exception for journalism, along with the scientific and cultural uses. It is also notable that draft Article 42 includes a requirement that any use of the TDM exception “does not unjustifiably harm the economic interests of the owners” or compete with or replace the original works. Like the EU, subsection (42)(1) requires security conditions and limits the time for keeping the works.
Draft section 42(2) appears to provide a broader TDM exception that is not limited to scientific research, journalism, or cultural institutions, with additional requirements that entities using 42(2) must also not communicate the work to the public and must have had legitimate access to the works in the first place. This would seem to suggest that scraping is not allowed under draft section 42(2), but possibly is allowed for scientific research, journalistic purposes, or cultural purposes. Notably, this draft broad TDM exception in 42(2) does not include any kind of mechanism for rights holders who do not wish their works to be used in TDM activities to opt-out.
Finally, draft subsection 42(3) states that Brazil’s general data privacy law does apply to TDM activities. Thus, if this proposed language becomes law, entities using the TDM exception will still have to consider data privacy.
China
We have not identified a definition of ‘text and data mining’ in Chinese legislation or policy, so we have relied on the definition from Directive (EU) 2019/790:
‘text and data mining’ means any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.
This definition exists with a European Union (EU) law regarding exceptions and limitations of copyright for text and data mining. Consequently, to consider the Chinese position regarding text and data mining, we turn first to frame the copyright regime in China.
Copyright Law of the People’s Republic of China (CLC)
Copyright Law of the People’s Republic of China in an administrative regulation enacted in 1951 and last revised in 2021. The Copyright Law aims to protect the copyright of authors in their literacy, artistic and scientific works and the copyright-related rights and interests, of encouraging the creation and dissemination of works which could contribute to the construction of socialist spiritual and material civilization, and of promoting the development and prosperity of the socialist culture and science.
It applies to:
- works of chinese citizens, legal entities or other organizations, whether published or not
- work of a foreign or stateless person which is eligible to enjoy copyright
- under an agreement concluded between the country to which the foreign belongs and China
- international agreement
- works of foreign or stateless persons first published in the territory of China
Fair Use Doctrine
China is a civil law country which adheres to early continental European traditions, following the concept of the European droit d’auteur system. The structure of China’s copyright exception model can be considered as a closed list rather than an open-ended one (e.g U.S).
Specifically, Article 22 of the Chinese Copyright Law, the so-called fair use clause, lists 13 situations in which the use of copyright material constitutes permitted fair use, but it is difficult to categorize the use of copyright materials for AI training within these scenarios. Nevertheless, the Chinese Supreme People’s Court judicial policy document offers some flexibility in applying the fair use doctrine under special circumstances. It states that:
‘Under special circumstances as necessary for promoting technological innovation and business development, a use of works may be determined as reasonable after consideration of the nature and purposes of use, the nature of works used, the quantity and quality of the part of works used, impacts of use on potential markets or values, and other factors, provided that such use neither contravenes the normal use of the works nor results in unreasonable damage to the lawful interests of the author.’
Additionally, the Regulations for the Implementation of the CLC (RICL) provides an in-depth examination of Article 22 CLC. According to Article 21 RICL ‘the exploitation of a published work which may be exploited without permission from the copyright owner in accordance with the relevant provisions of the Copyright Law shall not impair the normal exploitation of the work concerned, nor unreasonably prejudice the legitimate interests of the copyright owner. Thus, this two-step test serves as a complement to the listed exceptions in Article 22 CLC: the extent of exempted uses should not violate the two step test of Article 21 RICL.
With the law being last revised in 2021, text and data mining is not specifically referenced as an exception to copyright. Nevertheless, a few of the listed exceptions may concern text or data mining used for AI training.
Fair use exceptions
Article 22(1) CLC provides that ‘use of a published work for the purposes of the user’s own private study, research or self-entertainment’ can be exempted. However, this example will not offer much safeguards as most AI users are corporations and the research they carry out cannot be deemed as private.
Article 22(6) CLC provides that ‘translation, adaptation, compilation, playing, or reproduction in a small quantity of copies, of a published work for use by teachers or scientific researchers, in classroom teaching or scientific research of schools, provided that the translation or reproduction shall not be published or distributed’ can be exempted and may cover text and data mining. Still, the term ‘schools’ excludes commercial educational institutions and the commercial activities conducted by non-commercial institutions. Moreover, even if research institutions such as public universities may be covered by this exception, it is also required that the reproduction of published work shall be kept in ‘small quantity’. Thus, it is not a welcomed limitation for text and data mining that may require huge amounts of copyright data.
Finally, it could be a reasonably arguable position that Article 22, paragraph 7 may empower the State to engage in text and data mining, provided the State:
- mentions the name of the author and the title of the work; for example, in a list referencing all the works that were subject to the mining; and
- the mining is by a State organ fulfilling official duties.
Article 22, paragraph 7, paraphrased: A work may be exploited without permission from, and without payment of remuneration to, the copyright owner, if the name of the author and the title of the work are mentioned and the other rights enjoyed by the copyright owner by virtue of this Law are not prejudiced, in the circumstance of a State organ using a published work, within proper scope, for the purpose of fulfilling its official duties. |
Amendments to the Law
Regulation for Generative Artificial Intelligence Services
On April 11th 2023, the Cyberspace Administration of China released a draft of the Regulation for Generative Artificial Intelligence Services, in force since August 2023. It includes 21 measures encompassing a wide array of subjects pertaining to the development and provision of Generative AI services.
The regulation applies to ‘only applicable to services accessible to the general public within China’. It excludes from its scope AI services developed and utilized by enterprises, research and academics, research and academics institutions, and other public entities (Article 2). Generative AI technologies are defined as models and related technologies that can generate contents in the form of texts, pictures, audios and videos (Article 22).
Requirements for training data (article 7-8)
The Generative AI Measures require service provides to source data and foundation models from legitimate sources, respect the intellectual property rights of users, and process personal information with appropriate consent or legal basis under Chinese Laws. Providers must enhance the quality of training data, striving for ‘authenticity, accuracy, objectivity, and diversity’(article 7). Thus, the Regulation places significant emphasis on safeguarding personal data and information by imposing on providers the obligation to protect the input information of users and usage records during the service provision process. During the development phase, providers must establish ‘clear, specific and practical’ labeling rules for the training data. Conducting quality assessment of data labeling and sample verification to ensure accuracy of labeled content is also required (article 8). Additionally, illegal retention of input information that can be used to infer user identities, creation of user profiles based on input information usage, and disclosure of user input information to third parties are prohibited (Article 11).
AI security governance framework
On 10 September 2024, the State Council of China published an AI security governance framework, to come into effect on 1 January 2025. The Framework was put together by China's National Technical Committee 260 on Cybersecurity and seeks to promote responsible AI innovation. Although this framework is not a piece of legislation, compliance is suggested to be mandatory.
Onetrust identifies key areas of the framework as:
- Risk management and transparency
- Ethical responsibilities
- Accountability mechanisms
- Cross-border data, security, and privacy considerations
Relevant for consideration of copyright and text and data mining exceptions are:
- requirements for data integrity,
- recordkeeping obligations for data sources,
- systems that comply with China’s Data Security Law and Personal Information Protection Law
The Safety Guidelines of the Framework are significant and wide reaching. The relevant guidelines for training AI models apply to developers. Developers of AI models should (see 6.1):
|
|
|
|
|
|
This framework also has a national security flavour, and seeks to ensure AI develops without bias or discriminatory results.
Hong Kong Amendment of Copyright Ordinance
The Hong Kong government recently launched a two-month public consultation aimed at enhancing the Copyright Ordinance to address the challenges brought by AI technology. To maintain a balance of interests between copyright owners and users, the Copyright Ordinance contains a number of copyright exceptions that permit reasonable use of copyright works under specific circumstances. However, each exception is confined to certain situations and purposes and subject to their respective conditions and there is no specific copyright exception for text and data mining.
The consultation focuses on four areas:
- copyright protection of AI generated work
- copyright infringement liability for AI generated works
- introduction to specific copyright exception for text and data mining activities
- other issues e.g deepfakes
The proposal defines text and data mining as ‘automated techniques to extract and conduct computational analysis of extensive collection of text, images, data and other types of information for generating valuable insights, patterns, trends and correlations that would likely be unattainable through manual efforts alone’.
Case law
Beijing Internet Court grants Copyright to AI-Generated Image
The first case to address whether AI-generated work
- a) are protectable by copyright and
- b) who owns the copyright
was decided before the Beijing Internet Court in November 2023.
Facts
The plaintiff, Mr. Li, used Stable Diffusion (an artificial intelligence) to generate the image involved in the case and published it on the Xiaohongshu (Little Red Book) platform; the defendant, a blogger on Baijiahao, used the image generated by the plaintiff’s AI to accompany the article, and the plaintiff sued.
Judgement
First, the Court considered the meaning of ‘works’ under the Copyright Law of China, which provides that works must be original and reflect intellectual achievement. In terms of intellectual achievement, the Court noted that the plaintiff did not merely use existing pictures returned by search engines or rearranged pre-designed elements when creating the picture. Instead, the plaintiff designed how the woman should look in the picture and entered the relevant prompts. The plaintiff then inputted detailed prompts such as ‘japan idol’, ‘cool pose’, ‘viewing at camera’ and ‘film grain’, and then further adjusted the prompts based on the preliminary images generated by Stable Diffusion. According to the court, these actions demonstrates the plaintiff's intellectual input.
Regarding the concept of originality, the court notes that this means that the work should be independently completed by the author and that should reflect the author’s subjective expression. In the context of AI-generated images, originality should be assessed on a case-by-case basis. In this case, the plaintiff designed the different elements of the image by inputting and fine-tuning the prompts and adjusting the parameters, which ultimately demonstrates the plaintiff's original judgment. Thus, the Court held that the picture is protected by copyright as an original work. It recognized the picture as a ‘work of fine arts’ under Article 3 CLC.
Secondly, on the matter of copyright ownership, the Court held that Copyright Law provides that copyright shall be owned by the author of the work and an AI model cannot be an author because it is not a natural or legal person. Moreover, the designer of Stable Diffusion only created the AI model, it was not involved in the intellectual input leading to the creation of the picture and the licensing agreements states that the designer of Stable Diffusion does not claim rights in any output content.
Pending Case to address whether AI Training constitute Copyright Infringement
In China, the first case to address whether AI training constitutes copyright infringement is pending before the Beijing Internet Court.
The case was brought by four Chinese illustrators against the developers of Trik AI, a generative AI painting app released by the famous Chinese lifestyle platform Xiaohongshu, with a particular strength in generating traditional Chinese-style paintings. The illustrators claimed, among others, that Xiaohongshu’s use of their work to train its AI model constituted an infringement of their reproduction rights. Xiaohongshu, on the other hand, contends that using the plaintiffs’ works for AI training should be considered "fair use" under Article 24 of the Chinese Copyright Law. A judgment in this case has not yet been issued.
From a technical perspective, establishing copyright infringement in China depends on whether the AI model has made a ‘temporary’ or ‘permanent’ reproduction of a copyrighted work during the training process. Under the Chinese Copyright Law, the general rule is that it is only when a permanent copy is made that copyright infringement occurs.
India
The Indian Ministry of Commerce and Industry has stated that India’s existing regime of intellectual property rights (IPR) and copyright is sufficient to protect both rights holders of AI-generated works, and rights holders of content that may be used by Generative AI (e.g., training data). Consequently, the Ministry states there is no reason nor need to develop an additional regime specifically regulating AI and related innovations. As a result, there is no specific regulation to discuss relating to training data and scraping, also known as text and data mining (TDM), for AI training purposes. This paper thus focuses on aspects of the current regulatory regime that are relevant to the question of AI training data, and discusses potential limitations and criticisms of the reliance on the current regime.
Important to note is that India also lacks regulation regarding AI generally, not just lacking regulation regarding TDM either generally or for the purposes of AI training. Despite this, the government has addressed the growing challenge of AI through publishing the National Strategy for Artificial Intelligence which highlights numerous legal and societal challenges arising from AI. This Strategy does not explicitly mention copyright or legal issues arising from TDM, but highlights privacy problems that may be relevant for TDM for training as part of the AI lifecycle.
The Indian Ministry of Commerce and Industry brings attention to a few specific aspects of the current IPR regime. Exclusive economic rights of a copyright owner are provided by the Copyright Act 1957. For generative AI to make use of copyrighted material under this act, the use must be covered under fair dealing exceptions in Section 52 of the Copyright Act 1957. Additionally, India’s personal data regulatory framework may be a relevant consideration for TDM.
1.1 The Copyright Act 1957
The Copyright Act 1957 is India’s main regulatory instrument regarding intellectual property and copyright. Section 13 of the Copyright Act 1957 highlights which works are provided copyright protections, which include (1) original literary, dramatic, musical and artistic works, (2) cinematographic films, and (3) sound recordings. Section 14 expounds on actions that are prohibited to take against copyrighted works. Most relevant is the mention of the prohibitions of “the storing of [the copyrighted work] in any medium by electronic or other means” (emphasis added). This is applicable for the cases of a literary, dramatic, or musical work, a computer programme, an artistic work, and cinematographic films, covering all types of works mentioned in the Copyright Act 1957.
1.1.1 Infringement of Copyright and Fair Dealing Exceptions
Chapter XI Copyright Act 1957 addresses infringements of copyright. Section 51 outlines when copyright will be deemed infringed, which includes when anything is done which is the exclusive right of the copyright holder (Section 51(a)(i)), and other distributions of the copyrighted work, including when it is “to such an extent to affect prejudicially the owner of the copyright” (Section 51(b)(ii)).
Section 52 outlines when certain acts are not considered an infringement of copyright, commonly known as fair dealing exceptions. Section 52(a) excludes from infringement “a fair dealing with any work, not being a computer programme,” for the purposes of either “(i) private or personal use; including research; (ii) criticism or review [...]; [and] (iii) the reporting of current events and current affairs.” An accompanying explanation states that the storing of any copyrighted work in an electronic medium for the aforementioned purposes shall not constitute an infringement. Section 52(aa) (sic) excludes from infringements the certain actions towards copyrighted computer programmes, such as when a lawful possessor of a copy makes copies of the programme for certain purposes (Section aa(i)-(ii) (ab), (ac), (ad), …). Section 52 is expansive and outlines numerous other exceptions.
1.1.2 Application of the CRA 1957 to TDM
Generally, the Copyright Act 1957 is vague. The fair dealing exceptions in Section 52 are strict and lack any specific reference to TDM or more generally to the use of copyrighted information to develop a product as such. The Copyright Act 1957 does reference electronic storage and recent technological changes such as, for example, listing the circumvention of “an effective technological measure applied for the purpose of protecting any of the rights conferred by this Act” (Section 65A) as an offence. However, such mentions are limited and still revolve around the (illegal) act of creating a copy of a copyrighted work. Much of the Copyright Act 1957 is focused on the illegal replication of works.
The Copyright Act’s particular focus has resulted in scholarly opinion that the Act would actually not prohibit TDM despite a lack of explicit fair dealing exception. Dr. Arul George Scaria, Co-Director of the Centre for Innovation, Intellectual Property, and Competition (CIIPC) at National Law University in Delhi, has argued for the legality of TDM under Indian copyright law. He argues that Indian copyright law only protects the expression of ideas, not facts or ideas themselves. So, when a researcher, for example, uses a copyrighted article in the course of TDM, this should be considered a non-expressive use of the work outside of copyright protection. He goes so far as to argue that it is thus “morally and legally impermissible to allow publishers to use End User Licence Agreements or other contractual tools to restrict researchers from using articles for TDM.”
Additionally, Dr. Scaria emphasises that India’s copyright regime privileges not only the rights of the creators or copyright holders, but also the users of such works. He references theDelhi University Photocopy Shop (Single Bench) decision, wherein Justice Endlaw states that the rights of persons in Section 52, the fair dealing exception, should not be interpreted narrowly or strictly. The rights given to users should be interpreted on par with those given to creators. Lastly, Dr. Scaria argues that India’s fair dealing regime can provide the flexibility to apply to newer technological developments such as TDM. Whilst Dr. Scaria makes a compelling argument, part of that compelling argument derives from his discussion of TDM as done by researchers, rather than addressing, for example, the use of TDM to train AI for ultimately commercial purposes.
Another case did address TDM for another use, by the state. Whilst this does not explicitly address the use of TDM to train AI, it still provides perspective on how TDM might be understood in the context of India’s regulatory framework. The Supreme Court of India, in the case of Justice K.S. Puttaswamy (Retired) vs Union of India and Ors., recognised TDM’s importance for the state. The Court held that “[TDM] with the object of ensuring that resources are properly deployed to legitimate beneficiaries is a valid ground for the state to insist on the collection of authentic data.” However, this case addressed TDM primarily from a privacy law perspective, rather than copyright; consequently, there still is no litigation that addresses the TDM in a copyright context. This raises further concerns considering the lack of explicit inclusion of a TDM fair dealing exception in Section 52 of the Copyright Act 1957.
As a result, the likely means of justifying TDM is through Section 52(1)(a) which allows for private and personal use of copyrighted works, including for research. This provision is nuanced, wherein the making of copyrighted works publicly available to facilitate private and personal use such as by public libraries could be justified. However, the Calcutta High Court in 2013 stated that “commercial exploitation” does not fall under “private or personal use,” a position reaffirmed by the Bombay High Court in 2018. Thus, it seems that the main way in which TDM could be justified under India’s copyright regulation is through arguing that TDM does not infringe on the expression of the idea or data, and thus copyright does not apply, as argued by Dr. Scaria.
1.2 Database protection
A newer form of intellectual property rights relates to database protection, which may be relevant when it comes to TDM. India does not have an explicit database right in its intellectual property right regime, but it has recently been subsumed into its copyright regulatory framework. The Delhi High Court in Himalaya Drug Company v Sumit ruled that compilations of databases could be considered literary works and consequently covered by the Copyright Act 1957. This would mean use of compilations of databases in TDM contexts would have to fall under fair dealing exceptions in Section 52 of the Copyright Act 1957.
1.3 Personal data mining
Another problematic aspect of data mining is the use of personal data. The source of data used for data mining can be essential, and if the data obtained is not public knowledge or obtained consensually, Indian personal protection legislation might apply. Companies and organizations that do not obtain personal data with permission or discriminate against individuals based on age, sex, gender, race, or religion are not going to be compliant with local or transnational laws. Specifically in India, data mining in such circumstances is regulated by the Information Technology Act, 2000 (IT Act) (e.g Article 43A which outlines rules on unauthorized access, sensitive personal data, and information, as well as compensation for failure to protect data), and the Personal Data Protection Bill, 2019 (PDP Bill). However, some scholars discuss that the production of Privacy Preserving Data Mining Techniquescould circumvent regulation from data protection regulation. This approach is somewhat reflected in the National Strategy for Artificial Intelligence, wherein the Indian government recommends investing in privacy-preserving AI research methods to reduce the “risks of data exploitation and personal identification (from an anonymisation dataset).”
United States
Text and data mining in the US
Introduction TDM
Text- and data mining is a process used to develop AI models. Data is fed into a model and algorithms are applied to the data. The term TDM is not defined in US Law, but rather as "machine and human- based inputs” under Title 15 Commerce and Trade. Another reference can be found under Machine Learning, which mentions "the ability to automatically learn and improve on the basis of data or experience.
US definition AI
The legal definition of AI originates from the National Artificial Intelligence Initiative Act of 2020, which regulates public bodies' use of AI:
"a machine based system that can, for a given set of human-defined objectives, make predictions, recommendations, or decisions influencing real or virtual environments. Artificial intelligence systems use machine- and human-based inputs to perceive real and virtual environments; abstract such perceptions into models through analysis in an automated manner; and use model inference to formulate options for information or action.”,
The legal question is: do any restrictions apply to the use of input such as publicly available text or data in digital form for the purpose of training AI models?
To the extent that the input may be protected by copyright, the Copyright Act of 1976 is applicable.
To the extent that the input may be pertaining to data created by human behavior, other acts or constitutional rights may apply.
To the extent that the use of the data may be unfair or harmful to competition or consumer rights, FTC regulations may apply.
Legal framework
U.S. Constitution
Patent and Copyright Clause
(a.k.a. Patent Clause, the Copyright Clause, and the Progress Clause)
The Congress shall have Power ...to promote the Progress of Science and useful Arts, by securing for limited Tımes to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.
Federal Law
The Copyright Act of 1976
The United States copyright law is contained in the United States Code, chapters 1 through 8 and 10 through 12 of Title 17. The Copyright Act of 1976, which provides the basic framework for the current copyright law.
The authority has been granted to the US Copyright Office since 1870.
To the extent that “human-based input’ and data can be a product of human work, it may be protected under the constitutional clause which grants an exclusive right to the Author or Inventor.
Copyright protection is granted to original works of authorship that are fixed in a tangible medium of expression . The author of the work generally holds copyright ownership generally, a right which she may transfer or license their rights to others.
Essential to the copyright protection in the US is the fair use doctrine, which allows for several exemptions for which limited use of copyrighted material is allowed without the author’s permission.
Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include—
(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
(2) the nature of the copyrighted work;
(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) the effect of the use upon the potential market for or value of the copyrighted work.
The legal discussion regarding TDM in the US is evolving essentially around the application of the exceptions under the legal doctrine of fair use.
Following the Executive Order Executive Order 13859 announcing the American Artificial Intelligence Initiative, the United States Patent and Trademark Office (USPTO) released a report titled “Public Views on Artificial Intelligence and Intellectual Property Policy.” The report consults a wide variety of stakeholder views on the impact of artificial intelligence (AI) across the intellectual property (IP) landscape, including patent, trademark, copyright, and trade secret policy, as well as developing issues about database protection.
In this report, the USPTO stated:‘ “The ingestion of copyrighted works for purposes of machine learning will almost by definition involve the reproduction of entire works or substantial portions thereof. Accordingly, whether this constitutes copyright infringement will generally be determined by considering the applicability of the fair use doctrine, an exception set forth in section 107 of the Copyright Act, 17 U.S.C. § 107. Fair use is applied on a case-by-case basis, requiring courts to weigh several statutory factors, and is highly fact-dependent.’ § 102).’
The report included the results of a poll among a broad array of public and private stakeholders. The report showed that the existing legal framework lacks instruments to tackle the challenges of AI. While the majority found the existing U.S. intellectual property laws calibrated correctly to address the evolution of AI (copyright, trademarks, trade secrets, and data fields), some suggested a need for new classes of IP rights. Overall, many agreed that existing commercial law principles (e.g., contract law) might adequately fill any gaps left by IP law in the wake of advances in AI .
COPIED ACT bill proposal
July 11th 2024 the bipartisan bill was introduced in Congress.
Content Origin Protection and Integrity from Edited and Deepfaked Media Act of 2024
The bill proposal addresses the new IP issues with respect to content by requiring transparency and content provenance information in an aim to protect artistic content.
The intention to establish NIST standards for provenance information seems to open the possibility for remuneration for rights-holders. This would be aligned with the intent of the constitutional right in the Patent and Copyright Clause.
Recent US Court cases regarding TDM
NEW YORK TIMES v. OPEN AI INC (2023)
The New York Times claimed that OpenAI and Microsoft, as creators of highly profitable generative AI models, did not give credit or remuneration to the New York Times despite relying on their works. The New York Times further alleges that this conduct not only infringes on their intellectual property rights but could also threaten their business model of providing news coverage, analysis and commentary on current events.
AUTHORS GUILD v. OPEN AI INC (2024)
17 authors argue that OpenAI should have first obtained their permission to use their copyrighted works and now seek a permanent injunction against OpenAI to prevent these alleged harms from reoccurring. The Authors Guild is also seeking damages for the lost opportunity to licence their works and asserts that OpenAI and Microsoft forced them into a position where they unknowingly aided their own market replacement.
Policy
Two executive orders have been released pertaining to the use of AI by public bodies.
Executive Order 14110 of October 30, 2023 Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence mentions data as understood in 1.1.
Regarding availability of data for training the Executive Order states:
(b) …This effort requires investments in AI-related education, training, development, research, and capacity, while simultaneously tackling novel intellectual property (IP) questions and other problems to protect inventors and creators.
(e) … The Federal Government will enforce existing consumer protection laws and principles and enact appropriate safeguards against fraud, unintended bias, discrimination, infringements on privacy, and other harms from AI. Such protections are especially important in critical fields like healthcare, financial services, education, housing, law, and transportation, where mistakes by or misuse of AI could harm patients, cost consumers or small businesses, or jeopardize safety or rights. At the same time, my Administration will promote responsible uses of AI that protect consumers, raise the quality of goods and services, lower their prices, or expand selection and availability.
(f) Americans’ privacy and civil liberties must be protected as AI continues advancing. Artificial Intelligence is making it easier to extract, re-identify, link, infer, and act on sensitive information about people’s identities, locations, habits, and desires. Artificial Intelligence’s capabilities in these areas can increase the risk that personal data could be exploited and exposed. To combat this risk, the Federal Government will ensure that the collection, use, and retention of data is lawful, is secure, and mitigates privacy and confidentiality risks. Agencies shall use available policy and technical tools, including privacy-enhancing technologies (PETs) where appropriate, to protect privacy and to combat the broader legal and societal risks—including the chilling of First Amendment rights—that result from the improper collection and use of people’s data.
The Executive Order stipulates the need for “tackling novel intellectual property (IP) question”, “enforce protection laws and principles”and “enact safeguards against harms from AI” as “AI is making it easier to extract …on sensitive information about people’s identities, locations, habits and desires.” The Executive order acknowledges the need for new policies to not only protects copyrights-holders in the context of AI, but also citizens as their behavioral data are being used to train AI-models.
Enforcement
As stated in section 1.1., human-based data may be a broad term, the use of which may violate several laws and regulations.. A certain responsibility to protect competition in the interest of consumers lies with the FTC. In the case of AI, it may be difficult to assess whether infringement of violations take place.
A special enforcement resolution was issued by the FTC in November 2023, “ directing use of compulsory process in a non-public investigation of products and services that use or are produced using artificial intelligence or that purport to detect the use of artificial intelligence”. The specific action enables the FTC to start investigations of possible use of AI in violation with Section 5 of the Federal Trade Commission Act, 15 U.S.C. § 45 to detect unfair, deceptive, anticompetitive, collusive, coercive, predatory, exploitative, or exclusionary acts or practices in or affecting commerce relating to products and services that use or are produced using artificial intelligence, including but not limited to, machine-based systems that can, for a given set of defined objectives, make predictions, recommendations, or decisions influencing real or virtual environments, or that purport to use or detect the use of artificial intelligence.
The proposed bill COPIED act may be instrumental in enabling the AI market to become more transparent, for rights-holders to receive remuneration based on standardized provenance mechanisms.