This workshop is organised by the EBU AI Benchmarking Group as part of the data-related activities of the Smart Media Production Strategic Programme.
Large language models, which form the basis of conversational tools, have attracted a lot of attention since the commercialisation of ChatGPT. Their evaluation and benchmarking is still an open question and there is a lot of research going on. The key point is that to generate a language model, LLMs are trained on a pretext task, such as finding hidden words in texts. However, assessing performance on this task does not necessarily reflect performance on more complex tasks such as content annotation, summarisation and natural language generation.
In this workshop, we will discuss the experiments conducted by broadcasters on LLMs and focus on the methodology to evaluate these models.