OpenAI Trains YouTube Videos for AI Training Models

  • Editor
  • May 15, 2024
OpenAI has recently been reported to transcribe more than one million hours of YouTube videos. This effort was to gather data to train its GPT-4 model. The New York Times highlighted that OpenAI might have overlooked YouTube’s copyright rules, a concern for Google, the owner of YouTube.

To convert the videos‘ audio into text, OpenAI used a special tool they developed, named Whisper. The text obtained from these videos was then used to make ChatGPT smarter in conversations. Inside OpenAI, there were discussions about whether using YouTube’s videos this way was right or wrong. Despite potential policy issues, the company decided to proceed, driven by the need for diverse data to improve their AI.

Greg Brockman, OpenAI’s president, was reportedly involved in choosing the videos for this project. YouTube does not allow its videos to be used this way. However, an openAI spokesperson Lindsay Held said

We use various data sources, including public and partnership-obtained data, to teach their AI about the world.

Google’s response was cautious, with spokesperson Matt Bryant pointing out that YouTube’s rules don’t allow this kind of use of their content. Neal Mohan, YouTube’s CEO, also expressed concern, indicating that if OpenAI used YouTube for training another AI product, Sora, it would clearly break the platform’s policies.

Interestingly, The New York Times report also mentioned that Google might be doing something similar with its AI model, Gemini, by using YouTube video texts for training. This raises questions about copyright and the use of online content for AI development.

The situation puts a spotlight on the complex issue of using online content to train AI technologies, raising important questions about copyright and the ethics of data use.

