================================================== -->

Suchir Salhan




Research Interests

Small Language Models

My research is concerned with building data-efficient small Language Models. While industry-led efforts have built competitive LLMs that have fundamentally shifted the job of the contemporary Natural Language Processing (academic) researcher in various ways, there are still several fundamental, open questions to address that are not obviously ancillary to those pursued commercial AI research labs.

For small LMs, tight parameter budgets make each decision critical, yet researchers still lack systematic, scientific ways to test and refine new ideas. We introduce PicoLM, a lightweight, modular framework that enables systematic, hypothesis-driven research for small and medium-scale language model development. Pico consists of two libraries that together provide a practical sandbox where researchers can make targeted changes to a model's architecture or training procedures and directly observe their effects on the model's behaviour. To support reproducible experimentation, we also release a suite of baseline models, trained under standardised conditions and open-sourced for the community. Check out the YouTube Video put together by Zeb Goriely: Introducing PicoLM | YouTube.

Developmental Interpretability (DevInterp) and Language Model Learning Dynamics Research are fundamental to developing small LMs. Small LMs are crucial for low-resourced and democratised Language Modelling, and should still be a serious focus of hypothesis-driven scientific exploration.

Cognitively-Inspired AI

The second strand of my research focuses on cognitively-inspired AI. The scientific study of the human capacity for Language demands a multi-model approach that combines a characterisation of the universal and language or speaker-specific substance of grammars and the inductive biases that support their emergence in language acquisition, and how these conditions interact with domain-general cognition. This is why I am, in part, drawn to the work of the BabyLM Workshop. Much of my work aligns with the goals of the Strict, Interaction, Multimodal (and future Multilingual!) Tracks. However, researchers interested in cognitively-inspired AI should be judicious in disentangling their rationale and motivations when working with BabyLMs (see my recent position paper for further discussion).

In joint work, we introduce ByteSpan, a dynamic tokenisation scheme that groups predictable bytes - rather than pooling their representations. It was inspired by recent dynamic tokenisation methods (e.g., Meta's Byte Latent Transformer) that operate directly on bytes and pool their latent representations into patches, and its similarities to computational models of word segmentation that determine lexical boundaries using spikes in an autoregressive model's prediction error.

Small Language Models offer opportunities for more precise characterisation of the diversity of language learners in multilingual language scenarios, and the simulation of learner characteristics offers plenty of opportunities for improving the quality, personalisation and explainability of language learning tools. I actively explore this with local collaborators in the Computer Lab.

I also actively pursue research interests in Linguistics and Cognitive Science. My work aims to cross-cut modern Generativist approaches – specifically, so-called neo-emergentist approaches pursued by several Cambridge-based theoretical linguists (see my previous work for a representative computational cognitive model) – and information-theoretic work traditionally pursued by functionalists, which has been born out of the fruitful mentorship of Dr Fermín Moscoso del Prado Martín . See our ACL 2025 Poster (Main Conference) and ACL Presentation Slides for examples of work on parsing and language processing/acquisition (born out of the Cambridge L95 Parsing and Syntax course which I was a Teaching Assistant and Guest Lecturer for). More recently, we have been working on information-theoretic and probabilistic models of cross-lingual phonemic distributions (see slides from a presentation at Invited Keynote Talk at "The 13th International Conference on the Mental Lexicon" in Montréal, Québec, Canada here ) – stay tuned for more updates on this front!

Research News

Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance (Fermín Moscoso del Prado Martín, Suchir Salhan) @ ACL Main Conference (Computational Linguistics Journal Accepted Paper) ( Vienna, Austria), July 2025.

ByteSpan @ ICML Tokenisation Workshop ( Vancouver, Canada), July 2025

July - August 2025 – I am very excited to announce that I will be co-supervising two Small Language Model Undergraduate Research Opportunity (UROP) students, based on Pico, our learning dynamics framework. Stay tuned for more information about the UROPs and the application process, or email sas245@cam.ac.uk or pjb48@cam.ac.uk if you have any questions.

July - August 2025 - I am a Google DeepMind Research Ready Mentor with Richard Diehl Martinez and Prof Paula Buttery, supported by Google DeepMind, the Hg Foundation and the Royal Academy of Engineering.

June 2025 - Dr Fermín Moscoso del Prado Martín & Suchir Salhan. Invited Keynote Talk at "The 13th International Conference on the Mental Lexicon" in Montréal, Québec, Canada. The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers

June 2025 - Accepted Position Paper on BabyLMs and Linguistic Theory in the Cambridge Occasional Papers in Linguistics (CoPiL): Linguistics in the Age of Language Models: What Can Cognitively-Inspired Language Models Offer to Linguistic Theory?

June 2025 - Posters at the Cambridge Learning & Human Intelligence (LHI) Expo (Department of Computer Science & Technology), Cambridge Centre of Human-Inspired AI.

In Easter 2025, Dr Weiwei Sun and I are co-organising a reading group on Computational Models of Language. We meet weekly in the Department, with hybrid options, for those joining online to discuss Neural Grammar Induction, Computational Models of Acquisition, Change and Language Evolution. To join or express interest, email sas245@cam.ac.uk or ws390@cam.ac.uk

March 2025 - We released PicoLM, the Cambridge Small Language Model & Learning Dynamics Framework. Check out the YouTube Video put together by Zeb Goriely: Introducing PicoLM | YouTube.

March 2025 - I attended and presented a poster at HumanCLAIM in Gottingen, Germany, kindly organised by Prof Lisa Beinborn ( Göttingen, Germany).

March 2025 - I delivered a presentation in Tubingen on “Human Validated Grammar Profiles for Language Models” in a colloquium organised by Prof Detmar Meurers ( Tübingen, Germany).

In Lent 2025, I am the Teaching Assistant (a new role equivalent to Lead Demonstrator) for CST IA Machine Learning & Real World Data and am a supervisor for Machine Learning & Bayesian Inference (MBLI) [CST Part II].

I organise the Natural Language & Information Processing (NLIP) Seminars in the Department of Computer Science & Technology, University of Cambridge. These are weekly seminars with speakers presenting their research on Language Models, Machine Learning, Cognitive Modelling and more traditional topics in Computational Linguistics.

November 2024 - I delivered my first guest lecture in November 2024 for an MPhil course in the University of Cambridge with Prof Buttery and Dr Fermín Moscoso del Prado Martín on Language Model Evaluation – aged 22, this was a great opportunity and privilege so early in my “formal” academic career.

November 2024 - I presented my MEng Thesis @ The 2nd BabyLM Workshop, colocated in CoNLL, EMNLP ( Miami, Florida, U.S.). See the ACL paper here: https://aclanthology.org/2024.conll-babylm.15/. Thanks to my Part III Supervisors (Zeb, Richard, Andrew and Paula) for all their help!

Team & Collaborators

Collaborators

I am very grateful to a number of academics, researchers and PhD students whom I am currently working with and/or previously worked with:

  • The Cambridge PicoLM Team: Richard Diehl Martinez (and his MPhil/Part III students: Yuval Weiss and David Demitri Africa), Ryan Daniels (a Machine Learning Engineer with the Accelerate Programme for Scientific Discovery), and Prof Paula Buttery.
  • Zeb Goriely, Pietro Lesci and Julius Cheng.
  • XFACT (KAIST AI) and NAVER Cloud : Jiwoo Hong (KAIST), Dr James Thorne, Jeonghoon Kim (NAVER Cloud, KAIST AI) and Woojin Chung.
  • Dr Fermin Moscoso del Prado Martin – we are working on information-theoretic models of diachronic phonological typology. Paul Siewert and I have also been thinking about Category Theoretic approaches in Linguistics.
  • Yuan Gao (CST) , Mila Marcheva (CST) and Nuria Bosch Masip (TAL).
  • Gabrielle Gaudeau, Dr Diana Galvan Sosa, Dr Donya Rooin (MilaNLP, Bocconi University), Dr Zheng Yuan and Hongyi (Sam) Gu (KCL/NetMind.AI).
  • Dr Konstantinos Voudouris (Institute for Human-Centered AI at Helmholtz Munich).
  • MPhil Students

    I am always open to discussing research ideas that might (or might not!) be related to these directions!

    Being acknowledged for providing feedback on a paper draft is a very kind and generous token of appreciation! Here is a running list of papers I have been acknowledged in:

    Cambridge Lecture Notes and Supervision Sheets

    Here is my lecture handout on Language Model Evaluation for the L95 Module for the MPhil in Advanced Computer Science: L95 Lecture | Michaelmas 2024

    I am currently in the process of collating a collection of lecture notes, handouts and supervision materials for courses I am supervising (largely in the Computer Science Tripos) or have previously taken.

    I supervise several courses for Cambridge Computer Science undergraduates:

    Follow: