Keynotes | MELT Workshop @ COLM 2025

Monojit Choudhury

Title: Meta-Cultural Competence: What LLMs Should Know About Culture to Serve the Next Billion Users

Abstract: Anthropologist Dan Sperber once remarked that “culture is the precipitate of cognition and communication within a human population.” Culture, in this sense, both shapes and is shaped by the flow of conversation. For conversational AI systems to truly engage users meaningfully, they must be capable of interpreting the user’s perspective—something that inherently involves cultural understanding.

Yet culture is notoriously hard to pin down: it resists fixed definitions and defies any quantification. In this talk, I propose a computational framework for meta-cultural competence —one that treats cultural knowledge not as static content but as a dynamic prior. This prior informs conversational interpretation and generation, while being continuously refined through mechanisms of explication (surfacing implicit norms or values) and negotiation (adapting meaning in interaction). The goal is to move toward systems that are not just culturally sensitive, but culturally responsive—capable of evolving alongside the user’s own communicative context.

Bio: Monojit Choudhury is a professor of Natural Language Processing at Mohommed Bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi. Prior to this, he was a principal scientist at Microsoft Research India and Microsoft Turing. Prof Choudhury’s research interests lie in the intersection of NLP, Social and Cultural aspects of Technology use, and Ethics. In particular, he has been working on multilingual and multicultural aspects of large language models (LLMs), their use in low resource languages and making LLMs more inclusive and safer. Prof Choudhury takes a keen interest in popularizing linguistics and AI through puzzle solving; he is the general chair of Indian national linguistics Olympiad, the founding co-chair of Asia-Pacific linguistics Olympiad, and a founding board-member of International AI Olympiad. He holds a BTech and PhD degree in Computer Science and Engineering from IIT Kharagpur.

Pedro Ortiz Suarez

Title: Expanding the Language and Cultural Coverage of Common Crawl

Abstract: The Common Crawl Foundation is a nonprofit organization that has been operating since 2007. Its mission is to preserve and freely share samples of the public Internet. Common Crawl is a key partner to the AI, ML, and NLP communities, as well as many other research communities. We thus believe that improving Common Crawl’s language diversity as well as its cultural and community coverage will directly benefit everyone from the AI to the crawling and archiving communities. In this talk we present past and current efforts driven by the Common Crawl Foundation and its partners in the research community, to improve language and cultural coverage of web archives, specially for underserved languages and communities, we also present the challenges that we have encounter as well as the impact that our efforts have already had for the AI, ML, and NLP projects.

Bio: Pedro Ortiz Suarez is a senior research scientist at the Common Crawl Foundation known for his work on building large-scale multilingual corpora; he will talk about this complex process, and how to include very low-resource languages.

Imanol Schlag

Title: Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Abstract: We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today’s open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting this http URL exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

Bio: Imanol Schlag is an AI Research Scientist at the ETH AI Center and co-lead of Apertus developed as part of the Swiss AI Initiative. He began his career with an apprenticeship in informatics at a Swiss bank followed by his military service. He then earned his BSc in Computer Science from FHNW and MSc in Artificial Intelligence with distinction from the University of St Andrews, Scotland. He completed his PhD with distinction at USI/IDSIA under Prof. Jürgen Schmidhuber in 2023, focusing on systematic generalisation of neural networks and fast weight programmers—scalable self-modifying neural architectures (thesis). During his PhD he was invited to join Meta FAIR, Google Research, and Microsoft Research for research internships, where he investigated foundational questions in neural computation, scalable neural network architectures, and LLMs. After his defense He worked with Prof. Thomas Hofmann before moving to the ETHZ AI Center.

Shixiang Shane Gu

Abstract: TBD

Bio: Shixiang Shane Gu is a Senior Staff Research Scientist in Google DeepMind, where he leads the Multilinguality team in Gemini Post-Training. Previously, he was a researcher in the ChatGPT team at OpenAI and an ex-Research Scientist at Google Research, Brain Team and a Visiting Associate Professor (Adjunct Professor) at the University of Tokyo.

Julia Kreutzer

Title: Optimizing Multilinguality Post Training

Abstract: Is multilinguality determined in pretraining? Starting from insights around building Aya Expanse, this talk will provide a few examples of how we can optimize multilinguality post pre-training, with RL, via test-time scaling, and in data distillation or synthetic data generation. We will learn that techniques optimized for English sometimes disappoint, but that there are surprisingly simple and effective tricks to bootstrap multilingual performance, especially when focusing on open-ended tasks.

Bio: Julia Kreutzer is a Senior Research Scientist at Cohere Labs, where she conducts research on multilingual large language models and their evaluation. Previously, she worked on machine translation, in her previous role at Google Translate, and during her PhD at Heidelberg University and in collaborations with grassroots NLP communities for lower-resource languages.