ML Data Pool

Building collective datasets for Machine Learning

2020

  • status_med_12px.png Contact the EBU Members and identify their potential contributions (Q2 2020)
  • status_med_12px.png Identify any legal aspects and the project organisation (Q2 2020)
  • status_med_12px.png Build the project plan (Q2 2020)  

Machine Learning (ML) projects require large amounts of data for training. Including 'journalistic’ knowledge, such as lists of public personalities, helps with inference. However, individual broadcasters’ libraries are usually too small. By pooling this the data and contextual knowledge, EBU Members can collectively build sufficiently large, and more diverse, datasets. This benefits all participants. 

Goals

  • Building a collective dataset to train and to benchmark AI applications
  • Sharing resources and software to perform the data cleaning
  • Sharing resources to generate the ground truth materials (for instance from subtitles or other labelling methods)
  • Sharing the journalistic knowledge of the EBU members to build inference datasets

EBU Members can join this group to contribute or keep in touch.

Contributions may include:

1. sharing data from your archives
2. helping to gather data in other ways, such as web scraping
3. contributing information, such as a list of local public personalities
4. helping to develop software to normalize and clean the collected data
5. helping to define the database’s architecture

Project organisation

This project is chaired by Swiss-French public service broadcaster RTS. The EBU hosts the project, data and software.

Related topics

AI benchmarking