The EBU and several Members have released BenchmarkSTT v1.0, a tool for benchmarking Machine-Learning-based speech-to-text (STT) systems. Designed for ease of use and automation, it can be integrated into production workflows, for example to dynamically determine the best model for a specific transcription task.
Automatic speech-to-text algorithms can have enormous operational value where large amounts of audio-visual content are processed, such as for the creation of subtitles, to enable new accessibility services, and for the more efficient use of archives. BenchmarkSTT was designed to help choose the best implementation for a given use-case.
The tool was created in response to requests from the EBU Membership. Unlike other tools in the area of Artificial Intelligence and Machine Learning, this command-line tool was designed to be used by non-specialists. BenchmarkSTT was developed by Flemish Institute for Archiving (VIAA), BBC, France Télévisions, the Italian RAI, Swedish Radio, the Swiss SRG and others gathered under the EBU umbrella. It is open source and available on Github, with extensive documentation and a detailed step-by-step tutorial.
Finding the most suitable STT engine for the job
Speech-to-text technology has improved enormously in recent years and is now one of the most common applications for Artificial Intelligence and Machine Learning in the media sector. Users can choose from a number of commercially available algorithms and value-added services or choose to implement and run their own. But every implementation and deployment has different strengths in terms of speed and accuracy for a given language and domain.
The first release of BenchmarkSTT helps to make comparisons more objective by measuring the Word Error Rate (WER) for machine-generated transcripts. It performs its core tasks out of the box, but expert users and developers can easily extend the code and add new features as required. Distributed through the PyPI software repository, benchmarkSTT is installed quickly with a single command.
For the next release, the EBU working group is now working on the definition and implementation of new metrics for specific use-cases. One of the metrics under development will focus on the error rate in recognizing named entities, such as countries, organizations and others. These future additions will further enrich the global metric, the WER, and give more detailed insights into the performance relative to certain key vocabulary that is important for human comprehension or further machine-based processing of a transcript.