Deep Neural Network-based dialogue enhancement


Difficulties in following speech on TV due to loud background sounds are a common issue in broadcasting.

Object-based audio (OBA) systems like MPEG-H Audio can address this problem by providing a personalized speech level.

To add customizable dialogues to material produced without OBA, deep neural networks (DNNs) can be applied to separate dialogues from the music and effects of the final audio mix.

One of the technologies used for this is MPEG-H Dialog+, which has been adopted for the new “Clear Speech” service of the on-demand platform of public broadcasters ARTE and ARD. Dialog+ combines source separation with automatic remixing, using the specially developed “Adaptive Background Attenuation (ABA)" algorithm. The result is a new, easier-to-understand mix that enables barrier-free use (while retaining the artistic intention as far as possible).

MPEG-H Dialog+ is developed for and with broadcasters and allows data driven, fully automated workflow integration, using a neural architecture with consecutive concatenations of local and global features to guarantee superior performance across a wide variety of conditions.

In this talk, additionally to the Dialog+ technology, we present an overview of source separation techniques, analyzing their trade-offs in audio quality, runtime performance, and suitability for offline and online speech enhancement applications.

Speakers/panelists:

  • Daniela Rieger (Fraunhofer IIS),
  • Philipp Grundhuber (Fraunhofer IIS)
  • Werner Bleisteiner (BR)
  • Daniele Airola (Rai)

Presentations

Upcoming EBU Events