Eyal Lavi (BBC), Alberto Messina (RAI) and Alexandre Rouxel (EBU)

Two important aspects of the public service media remit are accessibility and the preservation of language diversity. The burgeoning market for translation and transcription tools may not always sufficiently account for these needs, making it all the more important to pick the right tool for a given job.

Automatic speech-to-text (STT) algorithms can have enormous operational value where large amounts of audiovisual content are processed. Examples include metadata generation and the production of subtitles, enabling new accessibility services and for the more efficient use of archives.

Every language has its specificities. When it comes to using STT transcription services, special characters, accents, capital letters and punctuation can all impact the meaning of a transcript. The large number of commercial STT vendors and the rapid development of new algorithms make for a dynamic and complex landscape. BenchmarkSTT was developed as a simple and service-agnostic tool to help choose the best STT implementation for a given use case.

This production-ready tool, first released in April 2020, was created in response to requests from the EBU membership, where there is a huge diversity of native languages. Unlike other tools in the area of AI and machine learning, this command-line tool was designed to be used by non-specialists.

The word error rate (WER) is the standard metric for benchmarking automatic speech recognition models, but it can be a blunt tool. It treats all words as equally important but in reality some words, like proper nouns and phrases, are more significant.

When these are recognized correctly by a model, they should be given more weight. Version 1.1 of BenchmarkSTT, recently released, thus introduced new metrics, namely the character error rate (CER) and – to address the problem just described – the ‘bag of entities’ error rate (BEER).

BenchmarkSTT is open source and available on both GitHub and PyPI, where there is extensive documentation and a detailed step-by-step tutorial. The team is now developing metrics to evaluate the quality of speaker diarization, showing ‘who spoke when’.

BenchmarkSTT was developed by the Flemish Institute for Archiving (VIAA), BBC (UK), Rai (Italy), Sveriges Radio (Sweden) and France Télévisions, under the umbrella of the EBU. Visit tech.ebu.ch/production and click on AI Benchmarking.

Who is using BenchmarkSTT?

Deutsche Welle, Germany

DW is evaluating transcription, translation and voice synthesis AI tools for the development of a cognitive platform. Subjective and objective evaluation metrics are combined, with BenchmarkSTT used for evaluating transcription.

TV 2, Norway

Benchmark STT allows TV 2 to compare progress in quality development for different vendors. The broadcaster’s backend AIHub obtains transcripts from different vendors for clips taken from an internal FTP video receiver tool, using release 1.1 of BenchmarkSTT running on Docker to evaluate them.

Radio France

BenchmarkSTT is used to assess the added value of transcribing audio content into text to feed search and recommandation features. Already using the WER for a rough evaluation of the transcription quality, they plan to use the BEER feature to help improve performance with respect to the words selected by the recommendation engine. Finally, they try to identify the content types for which the automated transcription gives the best results (e.g. news content).

Sveriges Radio

The Swedish broadcaster contributes to the BenchmarkSTT project with a view to having a better way to benchmark different transcription engines and help evaluate them. The use of WER and CER provides data to be used as part of ongoing evaluations of different vendors.

Rai – Radiotelevisione Italiana

In the context of a greater focus on integrating available AI cloud services in strategic production and publication processes, BenchmarkSTT is used for the evaluation of different STT vendors.


BenchmarkSTT was used to compare various STT algorithms used in the in-house transcription API as well as for a small-scale study of subtitles as ground- truth data. The study suggested that it may be possible to use subtitles, rather than expensive, manually created transcripts, as reference for calculating WER.


This article was first published in issue 48 of tech-i magazine.

Latest news