Cambridge University UROPs 2025: Small Language Models Research Projects

Week Topic Course Materials
1 (14/07) BabyLMs and Multilingual Evaluation Slides | Notes | Code
Required Readings:
Findings of the 1st BabyLM Challenge
Findings of the 2nd BabyLM Challenge
Handout on Language Model Evaluation
Recommended Additional Readings:
CLIMB – Curriculum Learning for Infant-inspired Model Building
Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies
Practical Tasks:
TO DO
2 (21/07) Tokenisation and Interpretability Bilingual BabyLM Training and Evaluation Slides | Notes | Code
Required Readings:
The Linear Representation Hypothesis and the Geometry of Large Language Models
Slides from Arthur Conmy (Google DeepMind)
Recommended Additional Readings:
Universal Dependencies
On the Acquisition of Shared Grammatical Representations in Bilingual Language Models
MultiBLiMP 1.0
Practical Tasks:
The TransformerLens Library
3 (28/07) Pretraining Language Models: The Pico Framework Slides | Notes | Code
Required Readings:
Pico Train Tutorial
Pico Analyze Tutorial
Recommended Additional Readings:
TO DO
Practical Tasks:
TO DO
4 (04/08) BabyLM Architectures & Feedback (ALTA CST) Slides | Notes | Code
Required Readings:
BabyLLama
TO DO
Practical Tasks:
TO DO
5 (11/08) Mechanistic and Developmental Interpretability Slides | Notes | Code
Required Readings:
Additional Readings:
Practical Tasks:
Interim Project Presentation
6 (18/08) Train Your Own BabyLM From Scratch Slides | Notes | Code
Interaction Track
Multimodal Track
Required Readings:
TO DO
Practical Tasks:
TO DO
7 (25/08) Small Language Models – Frontier Problems Slides | Notes | Code
Required Readings:
The Linear Representation Hypothesis and the Geometry of Large Language Models
Additional Readings:
Slides from Arthur Conmy (Google DeepMind)
Practical Tasks:
TO DO
8 (01/09) Small Language Models – Frontier Problems (Architectures) Slides | Notes | Code
Required Readings:
The Linear Representation Hypothesis and the Geometry of Large Language Models
Additional Readings:
Slides from Arthur Conmy (Google DeepMind)
Practical Tasks:
Final Project Presentations and UROP Project Reports (Friday 5 September 2025, Due 5pm)

Language Model Primers

Small Language Models

Find Out More!

Tokenisers
Non-Autoregressive Language Models

Useful Links (Raven Access Required)

See Dr Andrew Caines's page for lots of useful advice, particularly about access to compute resources.
Dr Russell Moore previously developed his ML Commando Course for the 2021 ALTA UROPs, which you might find helpful.