Google DeepMind, an artificial intelligence research lab, has announced a groundbreaking tool called Video-to-Audio (V2A), designed to generate soundtracks, sound effects, and dialogue for videos.
We’re sharing progress on our video-to-audio (V2A) generative technology. 🎥
It can add sound to silent clips that match the acoustics of the scene, accompany on-screen action, and more.
Here are 4 examples – turn your sound on. 🧵🔊 https://t.co/VHpJ2cBr24 pic.twitter.com/S5m159Ye62
— Google DeepMind (@GoogleDeepMind) June 17, 2024
This new AI technology marks a major advancement in AI-powered content creation, enabling more immersive and synchronized audiovisual experiences.
V2A uses video pixels and optional text prompts to generate synchronized audio that matches the visual content of the video.
This includes sound effects, background music, and even dialogue. The tool automates aligning audio with video, reducing the manual effort required in traditional audio editing.
Users can provide positive or negative prompts to refine the audio output, offering greater creative control over the final product.
This is what people worldwide have to say about this new AI update!
This is amazing! When can we try it?
— Minh Do (@minhsmind) June 17, 2024
As DeepMind explores AI’s capability to generate soundtracks and dialogue for videos, it underscores the versatile impact of AI across all forms of media. Delve deeper into how AI is shaping the music industry in our article, “I made this Pop Song using free AI tools – AI music revolution is nigh“, where we share our personal journey of creating music with AI.
V2A can generate an unlimited number of soundtracks for any given video, providing a vast array of audio options. The tool is versatile and suitable for enhancing archival footage, silent films, and educational videos, among other uses.
V2A employs a diffusion model to refine random noise into high-fidelity audio, leveraging both visual and textual inputs.
Runway and now Google both release on the same day?
It just keeps getting better
— Laurence Bremner (@LaurenceBrem) June 17, 2024
The model was trained on a diverse dataset comprising videos, audio, and detailed annotations to establish strong associations between specific sounds and visual events.
Despite its impressive capabilities, V2A has some limitations. The quality of the generated audio is highly dependent on the input video’s quality, and the tool struggles with lip synchronization for dialogue.
We’re gonna get audio for all these memes people have been animating … can’t wait
— Rohan (@rohanvisme) June 17, 2024
DeepMind acknowledges these challenges and is focused on ongoing research to improve these aspects.
The tool is not yet available to the public, as it will undergo rigorous safety assessments and testing to ensure its reliability and prevent misuse. Each generated audio will include Google’s SynthID watermark to identify it as AI-generated.
Not even a release date pic.twitter.com/3yCrkcSXI9
— Algorusty — (Christ/acc⏩✝️) (@algorusty) June 18, 2024
V2A is entering a competitive market with other AI-powered audio generation tools from companies like ElevenLabs and Stability AI.
However, DeepMind claims that V2A stands out due to its ability to automatically sync audio with visual content without requiring text prompts, which is a significant advancement over existing tools.
Really cool, but is this another Google AI product we can never try?
— Matti Paivike (@MattiPaivike) June 18, 2024
By combining video pixels and text prompts, DeepMind’s tool allows users to create scenes with drama scores, realistic sound effects, or dialogue that matches the characters and tone of a video.
Wow that’s cool I can’t wait to game in reality almost ,right on guys and ladies
— Shem Pullen (@ShemPullen) June 18, 2024
For example, in a demo of a car driving through a cyberpunk-esque cityscape, the prompt “cars skidding, car engine throttling, angelic electronic music” was used to generate audio that perfectly synced with the visuals.
Another example demonstrated an underwater soundscape using the prompt “jellyfish pulsating underwater, marine life, ocean.”
✍️ Prompt for audio: “Jellyfish pulsating under water, marine life, ocean.” pic.twitter.com/PftZPS7mgq
— Google DeepMind (@GoogleDeepMind) June 17, 2024
The tool can generate an endless stream of audio options, setting it apart from other AI tools like the sound effects generator from ElevenLabs.
Additionally, it can be paired with AI-generated video tools like DeepMind’s Veo and Sora, which plan to incorporate audio eventually.
V2A sounding are great..!! pic.twitter.com/37SexzsajY
— Andrewety (@cbk0649) June 18, 2024
The AI model powering V2A was trained on a combination of sounds, dialogue transcripts, and video clips, enabling it to match audio events with visual scenes effectively.
However, DeepMind is working to improve the tool’s ability to synchronize lip movements with dialogue and to maintain audio quality when dealing with grainy or distorted video inputs.
Hope this will be available to users.
Finally got access to the Gemini app after today’s os update! Excited to see what it can do. #GeminiApp #OSUpdate #Samsung— Amol Gargote (@amolgargote) June 17, 2024
DeepMind’s V2A represents a significant step forward in the integration of AI-generated audio and video, offering innovative solutions for content creators.
While promising, V2A still faces challenges, particularly in lip synchronization and audio quality with lower-quality videos.
It is of next level, very impressive I must say!!
— Aagarwal (@AaravAg81996317) June 18, 2024
Continued research and testing are essential before a public release. This development positions Google DeepMind at the forefront of AI technology in media production, potentially revolutionizing how soundtracks and audio effects are created and integrated into videos.
This will look good on the shelf.
Can’t wait for somebody to replicate this so we can have access to it.— Ornias (@OrniasDMF) June 17, 2024
The tool’s ability to generate audio without requiring meticulous manual synchronization and its flexibility in creating various soundscapes make it a powerful asset for enhancing video content across different genres and applications.
For more news and insights, visit AI News on our website.