A deep-learning audiovisual model from Google could impact voice-search, retail and creative production.
Announced on the Research Blog, the method can, according to Google, identify audio found in a video by isolating spoken words and distinguishing between language in the foreground and background.
Applied to YouTube, the model could potentially eliminate the need for creators to manually transcribe and caption their content, a common practice for maximising both user enjoyment and search-engine optimisation.
Researchers behind the model believe it will have a range of applications, from speech enhancement and recognition in videos and voice search to videoconferencing and the ability to improve hearing aids.
"In the near term, this will streamline video production—especially valuable in mobile first-video where lesser speaker quality makes clean mixing critical for comprehension," Patrick Givens, vice-president of VaynerSmart at VaynerMedia, said "Looking into the future, as we see more consumer attention migrating to audio-first channels this will also ease the burden of audio production."
Advertisers and agencies scrambling to optimise for voice-based search also see promise.
"The tip of the iceberg in big data is the analysis, while data collection is below the surface," Danish Ayub, chief executive of MWM Studioz said. "Similarly, with voice-search optimisation, the part of the work you don't see is the hours of manpower that go into transcribing the video content to ensure searchability."
Ayub adds that the technology could eliminate the need for both transcribers and paid software that can convert audio into text.
A version of this story was first published by Campaign Asia-Pacific