================================================== -->

Suchir Salhan

Suchir Salhan is a Computer Science PhD candidate at the University of Cambridge, working on Small Language Models under the supervision of Prof Paula Buttery. He is the Founder of Per Capita Media, Cambridge University's newest independent media organisation.

To contact me, email sas245@cam.ac.uk.

Quick Links:




Research Interests

Small Language Models

My research is concerned with building data-efficient small Language Models. While industry-led efforts have built competitive LLMs that have fundamentally shifted the job of the contemporary Natural Language Processing (academic) researcher in various ways, there are still several fundamental, open questions to address that are not obviously ancillary to those pursued commercial AI research labs.

For small LMs, tight parameter budgets make each decision critical, yet researchers still lack systematic, scientific ways to test and refine new ideas. We introduce PicoLM, a lightweight, modular framework that enables systematic, hypothesis-driven research for small and medium-scale language model development. Pico consists of two libraries that together provide a practical sandbox where researchers can make targeted changes to a model's architecture or training procedures and directly observe their effects on the model's behaviour. To support reproducible experimentation, we also release a suite of baseline models, trained under standardised conditions and open-sourced for the community. Check out the YouTube Video put together by Zeb Goriely: Introducing PicoLM | YouTube.

Developmental Interpretability (DevInterp) and Language Model Learning Dynamics Research are fundamental to developing small LMs. Small LMs are crucial for low-resourced and democratised Language Modelling, and should still be a serious focus of scientific exploration.

Cognitively-Inspired AI

The second strand of my research focuses on cognitively-inspired AI. The scientific study of the human capacity for Language demands a multi-model approach that combines a characterisation of the universal and language or speaker-specific substance of grammars and the inductive biases that support their emergence in language acquisition, and how these conditions interact with domain-general cognition. This is why I am, in part, drawn to the work of the BabyLM Workshop. Much of my work aligns with the goals of the Strict, Interaction, Multimodal (and future Multilingual!) Tracks. However, researchers interested in cognitively-inspired AI should be judicious in disentangling their rationale and motivations when working with BabyLMs (see my recent position paper for further discussion).

In joint work, we introduce ByteSpan, a dynamic tokenisation scheme that groups predictable bytes - rather than pooling their representations. It was inspired by recent dynamic tokenisation methods (e.g., Meta's Byte Latent Transformer that operate directly on bytes and pool their latent representations into patches, and its similarities to computational models of word segmentation that determine lexical boundaries using spikes in an autoregressive model's prediction error.

Small Language Models offer opportunities for more precise characterisation of the diversity of language learners in multilingual language scenarios, and the simulation of learner characteristics offers plenty of opportunities for improving the quality, personalisation and explainability of language learning tools. I actively explore this with local collaborators in the Computer Lab.

I also actively pursue research interests in Linguistics and Cognitive Science. My work aims to cross-cut modern Generativist approaches (specifically, so-called neo-emergentist approaches pursued by several Cambridge-based theoretical linguists ( see my previous work for a representative computational cognitive model) and information-theoretic work traditionally pursued by functionalists, which has been born out of the fruitful mentorship of Dr Fermín Moscoso del Prado Martín.

Research News

ByteSpan @ ICML Tokenisation Workshop ( Vancouver, Canada)

I am very excited to announce that I will be co-supervising two Small Language Model Undergraduate Research Opportunity (UROP) students, based on Pico, our learning dynamics framework. Stay tuned for more information about the UROPs and the application process, or email sas245@cam.ac.uk or pjb48@cam.ac.uk if you have any questions.

In Easter 2025, Dr Weiwei Sun and I are co-organising a reading group on Computational Models of Language. We meet weekly in the Department, with hybrid options, for those joining online to discuss Neural Grammar Induction, Computational Models of Acquisition, Change and Language Evolution. To join or express interest, email sas245@cam.ac.uk or ws390@cam.ac.uk

March 2025 - We released PicoLM, the Cambridge Small Language Model & Learning Dynamics Framework. Check out the YouTube Video put together by Zeb Goriely: Introducing PicoLM | YouTube.

March 2025 - I attended and presented a poster at HumanCLAIM in Gottingen, Germany, kindly organised by Prof Lisa Beinborn.

March 2025 - I delivered a presentation in Tubingen on “Human Validated Grammar Profiles for Language Models” in a colloquium organised by Prof Detmar Meurers.

In Lent 2025, I am the Teaching Assistant (a new role equivalent to Lead Demonstrator) for CST IA Machine Learning & Real World Data and am a supervisor for Machine Learning & Bayesian Inference (MBLI) [CST Part II].

I organise the Natural Language & Information Processing (NLIP) Seminars in the Department of Computer Science & Technology, University of Cambridge. These are weekly seminars with speakers presenting their research on Language Models, Machine Learning, Cognitive Modelling and more traditional topics in Computational Linguistics.

November 2024 - I delivered my first guest lecture in November 2024 for an MPhil course in the University of Cambridge with Prof Buttery and Dr Fermín Moscoso del Prado Martín on Language Model Evaluation – aged 22, this was a great opportunity and privilege so early in my “formal” academic career.

November 2024 - I presented my MEng Thesis at EMNLP in Miami. See the ACL paper here: https://aclanthology.org/2024.conll-babylm.15/. Thanks to my Part III Supervisors (Zeb, Richard, Andrew and Paula) for all their help!

Team & Collaborators

I am very grateful to a number of other academics, researchers and PhD students who I am currently working with and/or previously worked with:

I am always open to discussing research ideas that might (or might not!) be related to these directions!

Being acknowledged for providing feedback on a paper draft is a very kind and generous token of appreciation! Here is a running list of papers I have been acknowledged in:

Cambridge Lecture Notes and Supervision Sheets

Here is my lecture handout on Language Model Evaluation for the L95 Module for the MPhil in Advanced Computer Science: L95 Lecture | Michaelmas 2024

I am currently in the process of collating a collection of lecture notes, handouts and supervision materials for courses I am supervising (largely in the Computer Science Tripos) or have previously taken.

I supervise several courses for Cambridge Computer Science undergraduates:

Follow: