Published on January 9, 2024, 11:15 am

The Complex Debate: Content Usage In Training Generative Ai Models

Generative AI, which is revolutionizing the world with its capabilities, has recently come under scrutiny for the types of content used in training these large language models (LLMs). OpenAI, a prominent tech firm, has argued that it would be “impossible” to conduct AI training without utilizing copyrighted material. This statement has sparked discussions about the nature of the content employed by tech firms in building their LLMs.

In its submission to the UK House of Lords Communications and Digital Committee, OpenAI highlighted that copyright law covers a vast range of human expression, including blog posts, photographs, software code, and government documents. Consequently, it is virtually impossible to train today’s leading AI models without incorporating copyrighted materials.

OpenAI emphasized that even if training data was limited to public domain books and drawings created over a century ago, it would not meet the demands of today’s citizens. While copyright law does not prohibit training AI models legally, OpenAI offers an easy opt-out process for creators who wish to exclude their images from future datasets.

The three main sets of training data used by OpenAI for its LLMs consist of publicly available internet information, licensed third-party data, and data from users or human trainers. However, questions regarding the types of content utilized for generative AI training have been growing rapidly in recent months.

Criticism arose when it became apparent that biased tools were being created due to using text and images scraped from the internet as training data. These tools unintentionally reinforced existing stereotypes. Consequently, new complaints and lawsuits were filed by artists, authors, and companies concerned about unauthorized usage of their content.

As a result of these developments, there is now heightened interest in understanding the type of data used to train generative AI models and who owns that data. AI companies argue that utilizing publicly available internet data falls within fair use guidelines because this practice transforms the original content into something new. Nevertheless, media companies are becoming more aware of the potential value of their data and have implemented measures to block companies from accessing their sites.

To address these concerns, some publishers have chosen to collaborate with AI companies. For example, OpenAI signed a significant deal with Axel Springer in December 2023. However, other publishers have taken a different approach and filed lawsuits against AI companies like OpenAI and Microsoft for using copyrighted materials without permission or payment.

OpenAI, in response to the lawsuit filed by The New York Times, asserted that its training techniques are based on publicly available internet materials, which falls under fair use. They also stressed the importance of responsible use of their technology and expected users not to manipulate the models for unwarranted purposes.

While generative AI tools garnered attention last year for their impressive outputs, this year there is increasing scrutiny about the training content used. Questions surrounding ownership, fairness, and copyright issues have come to the forefront of discussions in the AI community.

In conclusion, as generative AI continues to advance rapidly, it is crucial to navigate ethical and legal challenges related to content usage. The ongoing debate around training data highlights the need for continued dialogue and collaboration between tech companies, content creators, and legal authorities to ensure transparency, fairness, and innovation within the field of artificial intelligence.


Comments are closed.