================================================== -->

Suchir Salhan

Suchir Salhan is a Computer Science PhD candidate at the University of Cambridge, working on Small Language Models under the supervision of Prof Paula Buttery. He is the Founder of Per Capita Media, Cambridge University's newest independent media organisation.

Quick Links:

Research Interests

Small Language Models

My research is concerned with building data-efficient small Language Models. While industry-led efforts have built competitive LLMs that have fundamentally shifted the job of the contemporary Natural Language Processing (academic) researcher in various ways, there are still several fundamental, open questions to address that are not obviously ancillary to those pursued commercial AI research labs.

For small LMs, tight parameter budgets make each decision critical, yet researchers still lack systematic, scientific ways to test and refine new ideas. We introduce PicoLM, a lightweight, modular framework that enables systematic, hypothesis-driven research for small and medium-scale language model development. Pico consists of two libraries that together provide a practical sandbox where researchers can make targeted changes to a model's architecture or training procedures and directly observe their effects on the model's behaviour. To support reproducible experimentation, we also release a suite of baseline models, trained under standardised conditions and open-sourced for the community. Check out the YouTube Video put together by Zeb Goriely: Introducing PicoLM | YouTube.

Developmental Interpretability (DevInterp) and Language Model Learning Dynamics Research are fundamental to developing small LMs. Small LMs are crucial for low-resourced and democratised Language Modelling, and should still be a serious focus of hypothesis-driven scientific exploration.

Cognitively-Inspired AI

The second strand of my research focuses on cognitively-inspired AI. The scientific study of the human capacity for Language demands a multi-model approach that combines a characterisation of the universal and language or speaker-specific substance of grammars and the inductive biases that support their emergence in language acquisition, and how these conditions interact with domain-general cognition. This is why I am, in part, drawn to the work of the BabyLM Workshop. Much of my work aligns with the goals of the Strict, Interaction, Multimodal (and future Multilingual!) Tracks. However, researchers interested in cognitively-inspired AI should be judicious in disentangling their rationale and motivations when working with BabyLMs (see my recent position paper for further discussion).

In joint work, we introduce ByteSpan, a dynamic tokenisation scheme that groups predictable bytes - rather than pooling their representations. It was inspired by recent dynamic tokenisation methods (e.g., Meta's Byte Latent Transformer) that operate directly on bytes and pool their latent representations into patches, and its similarities to computational models of word segmentation that determine lexical boundaries using spikes in an autoregressive model's prediction error.

Small Language Models offer opportunities for more precise characterisation of the diversity of language learners in multilingual language scenarios, and the simulation of learner characteristics offers plenty of opportunities for improving the quality, personalisation and explainability of language learning tools. I actively explore this with local collaborators in the Computer Lab.

I also actively pursue research interests in Linguistics and Cognitive Science. My work aims to cross-cut modern Generativist approaches – specifically, so-called neo-emergentist approaches pursued by several Cambridge-based theoretical linguists (see my previous work for a representative computational cognitive model) – and information-theoretic work traditionally pursued by functionalists, which has been born out of the fruitful mentorship of Dr Fermín Moscoso del Prado Martín . See our ACL 2025 Poster (Main Conference) and ACL Presentation Slides for examples of work on parsing and language processing/acquisition (born out of the Cambridge L95 Parsing and Syntax course which I was a Teaching Assistant and Guest Lecturer for). More recently, we have been working on information-theoretic and probabilistic models of cross-lingual phonemic distributions (see slides from a presentation at Invited Keynote Talk at "The 13th International Conference on the Mental Lexicon" in Montréal, Québec, Canada here ) – stay tuned for more updates on this front!

Research News

September 2025 – Eight accepted papers at EMNLP and NEURIPS workshops 🎉. Thanks to all my collaborators for their support in these projects that mark the end of my first year of my PhD!

4 Accepted Papers @ BabyLM Workshop in EMNLP ( Suzhou, China)

New! Paper accepted for BabyLM 2025: ‘Teacher Demonstrations in a BabyLM’s Zone of Proximal Development for Contingent Multi-Turn Interaction’ by Suchir Salhan, Hongyi Gu, Donya Rooein, Diana Galvan-Sosa, Gabrielle Gaudeau, Andrew Caines, Zheng Yuan & Paula Buttery: full text paper to follow

New! Paper accepted for BabyLM 2025: ‘Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling’ by Bianca-Mihaela Ganescu, Suchir Salhan, Andrew Caines & Paula Buttery: full text paper to follow

New! Paper accepted for BabyLM 2025: ‘BLiSS: Evaluating Bilingual Learner Competence in Second Language Small Language Models’ by Yuan Gao, Suchir Salhan, Andrew Caines, Paula Buttery & Weiwei Sun: full text paper to follow

New! Paper accepted for BabyLM 2025: ‘What’s the Best Sequence Length for BabyLM?’ by Suchir Salhan, Richard Diehl Martinez, Zebulon Goriely & Paula Buttery: full text paper to follow

PicoLM @ 5th Multilingual Representation Learning Workshop (MRL) @ EMNLP ( Suzhou, China)

New! Paper accepted for EMNLP MRL Workshop: ‘Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages’ by David Demitri Africa, Suchir Salhan, Yuval Weiss, Paula Buttery, Richard Diehl Martinez. Check out the paper on Arxiv here

PicoLM @ Systems Demonstrations Track at EMNLP ( Suzhou, China)

New! Paper accepted for EMNLP Systems Demonstration Track: ‘Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research’ by Richard Diehl Martinez, David Demitri Africa, Yuval Weiss, Suchir Salhan, Ryan Daniels & Paula Buttery. Check out the paper on Arxiv here

Cambridge @ CogInterp: First Workshop in Interpreting Cognition in Deep Learning Models, NeurIPS 2025 ( San Diego, USA)

New! Paper accepted for CogInterp @ NeurIPS 2025: ‘Position: Pedagogical Alignment of LLMs requires Diverse Cognitively-Inspired Student Proxies’ by Suchir Salhan, Andrew Caines & Paula Buttery: full text paper to follow

New! Paper accepted for CogInterp @ NeurIPS 2025: ‘Position: Theoretical Linguistics Constrains Hypothesis-Driven Causal Abstraction in Mechanistic Interpretability’ by Suchir Salhan & Konstantinos Voudouris: full text paper to follow

September 2025 – Two accepted Orals Papers at the 23rd Old-World Conference in Phonology (OCP23) , co-authored with Dr Fermín Moscoso del Prado Martín

New! Paper accepted for 3rd Old-World Conference in Phonology (OCP23) : ‘Convergent Equilibria in Cross-Lingual Phoneme Surprisal Distributions: Statistical and Simulation-Based Analysis’ by Suchir Salhan & Fermín Moscoso del Prado Martín. Extended Abstract on ResearchGate.

New! Paper accepted for 3rd Old-World Conference in Phonology (OCP23) : ‘The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers’ by Fermín Moscoso del Prado Martín & Suchir Salhan. Extended Abstract coming soon...

August 2025 - I'm co-chairing the poster session at the Cambridge Language Sciences Annual Symposium 2025: Ambitions for language science in 2050 (Thursday 27 November 2025 | 13:00 – 19:30, Cripps Court, Magdalene College, Cambridge)

August 2025 - I've joined the Organising Committee of the 23rd 3rd Old-World Conference in Phonology (OCP23), which will take place at Gonville & Caius College in Cambridge (United Kingdom) from 14 January to 16 January 2026. I'm looking forward to supporting the Lead organiser Yury Makarov (im562@cam.ac.uk) and Programme chair Prof. Bert Vaux (bv230@cam.ac.uk)

Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance (Fermín Moscoso del Prado Martín, Suchir Salhan) @ ACL Main Conference (Computational Linguistics Journal Accepted Paper) ( Vienna, Austria), July 2025.

ByteSpan @ ICML Tokenisation Workshop ( Vancouver, Canada), July 2025

I will also be presenting another paper from the Cambridge NLIP Group @ ICML Tokenisation Workshop Causal Estimation of Tokenisation Bias on behalf of Pietro Lesci (Cambridge) , Dr Clara Meister (ETH Zürich), Prof Thomas Hofmann (ETH Zürich), Prof Andreas Vlachos (Cambridge) and Dr Tiago Pimentel (ETH Zürich)

July - August 2025 – I am very excited to announce that I will be co-supervising two Small Language Model Undergraduate Research Opportunity (UROP) students, based on Pico, our learning dynamics framework. ~~Stay tuned for more information about the UROPs and the application process, or email sas245@cam.ac.uk or pjb48@cam.ac.uk if you have any questions.~~

July - August 2025 - I am a Google DeepMind Research Ready Mentor with Richard Diehl Martinez and Prof Paula Buttery, supported by Google DeepMind, the Hg Foundation and the Royal Academy of Engineering.

June 2025 - Dr Fermín Moscoso del Prado Martín & Suchir Salhan. Invited Keynote Talk at "The 13th International Conference on the Mental Lexicon" in Montréal, Québec, Canada. The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers

June 2025 - Accepted Position Paper on BabyLMs and Linguistic Theory in the Cambridge Occasional Papers in Linguistics (CoPiL): Linguistics in the Age of Language Models: What Can Cognitively-Inspired Language Models Offer to Linguistic Theory?

June 2025 - Posters at the Cambridge Learning & Human Intelligence (LHI) Expo (Department of Computer Science & Technology), Cambridge Centre of Human-Inspired AI.

In Easter 2025, Dr Weiwei Sun and I are co-organising a reading group on Computational Models of Language. We meet weekly in the Department, with hybrid options, for those joining online to discuss Neural Grammar Induction, Computational Models of Acquisition, Change and Language Evolution. To join or express interest, email sas245@cam.ac.uk or ws390@cam.ac.uk

March 2025 - We released PicoLM, the Cambridge Small Language Model & Learning Dynamics Framework. Check out the YouTube Video put together by Zeb Goriely: Introducing PicoLM | YouTube.

March 2025 - I attended and presented a poster at HumanCLAIM in Gottingen, Germany, kindly organised by Prof Lisa Beinborn ( Göttingen, Germany).

March 2025 - I delivered a presentation in Tubingen on “Human Validated Grammar Profiles for Language Models” in a colloquium organised by Prof Detmar Meurers ( Tübingen, Germany).

In Lent 2025, I am the Teaching Assistant (a new role equivalent to Lead Demonstrator) for CST IA Machine Learning & Real World Data and am a supervisor for Machine Learning & Bayesian Inference (MBLI) [CST Part II].

I organise the Natural Language & Information Processing (NLIP) Seminars in the Department of Computer Science & Technology, University of Cambridge. These are weekly seminars with speakers presenting their research on Language Models, Machine Learning, Cognitive Modelling and more traditional topics in Computational Linguistics.

November 2024 - I delivered my first guest lecture in November 2024 for an MPhil course in the University of Cambridge with Prof Buttery and Dr Fermín Moscoso del Prado Martín on Language Model Evaluation – aged 22, this was a great opportunity and privilege so early in my “formal” academic career.

November 2024 - I presented my MEng Thesis @ The 2nd BabyLM Workshop, colocated in CoNLL, EMNLP ( Miami, Florida, U.S.). See the ACL paper here: https://aclanthology.org/2024.conll-babylm.15/. Thanks to my Part III Supervisors (Zeb, Richard, Andrew and Paula) for all their help!

Team & Collaborators

Collaborators

I am very grateful to a number of academics, researchers and PhD students whom I am currently working with and/or previously worked with:

The Cambridge PicoLM Team: Richard Diehl Martinez (and his MPhil/Part III students: Yuval Weiss and David Demitri Africa), Ryan Daniels (a Machine Learning Engineer with the Accelerate Programme for Scientific Discovery), and Prof Paula Buttery.

Cambridge NLIP Group : Zeb Goriely, Pietro Lesci and Julius Cheng. Dr Guy Emerson. Dr Fermin Moscoso del Prado Martin – we are working on information-theoretic models of diachronic phonological typology. Paul Siewert and I have also been thinking about Category Theoretic approaches in Linguistics. Mila Marcheva (CST). Xiaochen (Neo) Zhu .

ALTA Institute (Computer Science & Technology) : Gabrielle Gaudeau, Dr Diana Galvan Sosa, Dr Zheng Yuan, Yuan Gao (CST) ,

XFACT (KAIST AI) and NAVER Cloud : Jiwoo Hong (KAIST/Amazon Rufus), Noah Lee (KAIST/Naver Cloud), Dr James Thorne, Jeonghoon Kim (NAVER Cloud, KAIST AI) and Woojin Chung.

Cambridge Theoretical and Applied Linguistics (TAL) : Nuria Bosch Masip (TAL).

Dr Donya Rooin (MilaNLP, Bocconi University) and Hongyi (Sam) Gu (KCL/NetMind.AI).

Dr Konstantinos Voudouris (Institute for Human-Centered AI at Helmholtz Munich).

Cambridge Language Technology Lab (LTL) Yijie Zhou (EJ) (LTL, Cambridge), Fangyu Liu (Cambridge LTL, Google DeepMind), Prof Nigel Collier (LTL, Supervisor for Research Project in LTL 2022)

Marek Masiak (Oxford University)

Dr Thiemo Wambsganss (University of St Gallen, now Bern University – UROP Supervisor 2021)

Students

I have had the pleasure of supervising the following students for their research projects!

MPhil Advanced Computer Science (ACS)/Part III CST

Bianca-Mihaela Ganescu – my MPhil student, completed a thesis on Small Multimodal (Vision-Language Models) for the MPhil in Advanced Computer Science. Co-supervised with Dr Andrew Caines and Prof Paula Buttery.

Undergraduate

Ellie Polyakova Reed (ep757), Shivan Arora (sa2200)
Jacy To (cyt33)
Ali Kheirkhah (PicoLM Research Intern, co-supervised by Richard Diehl Martinez)

I am always open to discussing research ideas that might (or might not!) be related to these directions!

Being acknowledged for providing feedback on a paper draft is a very kind and generous token of appreciation! Here is a running list of papers I have been acknowledged in:

Goriely & Buttery (2025) IPA-CHILDES & G2P+: Feature-Rich Resources for Cross-Lingual Phonology and Phonemic Language Modeling

Cambridge Lecture Notes and Supervision Sheets

Here is my lecture handout on Language Model Evaluation for the L95 Module for the MPhil in Advanced Computer Science: L95 Lecture | Michaelmas 2024

I am currently in the process of collating a collection of lecture notes, handouts and supervision materials for courses I am supervising (largely in the Computer Science Tripos) or have previously taken.

I supervise several courses for Cambridge Computer Science undergraduates:

CST IA (First Year) Introduction to Probability, Machine Learning & Real World Data (Supervisor & Teaching Assistant/Lead Demonstrator)
CST IB (Second Year): Artificial Intelligence, Formal Models of Language.
CST II (Third Year): Machine Learning & Bayesian Inference.

Follow: