Cambridge University UROPs 2025: Small Language Models Research Projects

Week	Topic	Course Materials
1 (14/07)	BabyLMs and Multilingual Evaluation	Slides \| Notes \| Code Required Readings: Findings of the 1st BabyLM Challenge Findings of the 2nd BabyLM Challenge Handout on Language Model Evaluation Recommended Additional Readings: CLIMB – Curriculum Learning for Infant-inspired Model Building Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies Practical Tasks: TO DO
2 (21/07)	Tokenisation and Interpretability Bilingual BabyLM Training and Evaluation	Slides \| Notes \| Code Required Readings: The Linear Representation Hypothesis and the Geometry of Large Language Models Slides from Arthur Conmy (Google DeepMind) Recommended Additional Readings: Universal Dependencies On the Acquisition of Shared Grammatical Representations in Bilingual Language Models MultiBLiMP 1.0 Practical Tasks: The TransformerLens Library
3 (28/07)	Pretraining Language Models: The Pico Framework	Slides \| Notes \| Code Required Readings: Pico Train Tutorial Pico Analyze Tutorial Recommended Additional Readings: TO DO Practical Tasks: TO DO
4 (04/08)	BabyLM Architectures & Feedback (ALTA CST)	Slides \| Notes \| Code Required Readings: BabyLLama TO DO Practical Tasks: TO DO
5 (11/08)	Mechanistic and Developmental Interpretability	Slides \| Notes \| Code Required Readings: Additional Readings: Practical Tasks: Interim Project Presentation
6 (18/08)	Train Your Own BabyLM From Scratch	Slides \| Notes \| Code Interaction Track Multimodal Track Required Readings: TO DO Practical Tasks: TO DO
7 (25/08)	Small Language Models – Frontier Problems	Slides \| Notes \| Code Required Readings: The Linear Representation Hypothesis and the Geometry of Large Language Models Additional Readings: Slides from Arthur Conmy (Google DeepMind) Practical Tasks: TO DO
8 (01/09)	Small Language Models – Frontier Problems (Architectures)	Slides \| Notes \| Code Required Readings: The Linear Representation Hypothesis and the Geometry of Large Language Models Additional Readings: Slides from Arthur Conmy (Google DeepMind) Practical Tasks: Final Project Presentations and UROP Project Reports (Friday 5 September 2025, Due 5pm)

Language Model Primers

Small Language Models

Find Out More!

Tokenisers
Non-Autoregressive Language Models

Useful Links (Raven Access Required)

See Dr Andrew Caines's page for lots of useful advice, particularly about access to compute resources.
Dr Russell Moore previously developed his ML Commando Course for the 2021 ALTA UROPs, which you might find helpful.