Sanyal et al., COLM 2024, NeurIPS 2023 WANT
• What: We scale up LAWA (see below) to large models.
• Why: Large model training -> large batch sizes -> large LRs -> LAWA makes (even more) sense.
PhD in LLMs @ UCL
Sanyal et al., COLM 2024, NeurIPS 2023 WANT
• What: We scale up LAWA (see below) to large models.
• Why: Large model training -> large batch sizes -> large LRs -> LAWA makes (even more) sense.
Kaddour et al., NeurIPS 2023
• What: A simple budget-aware LR scheduler outperforms most fancy efficient training methods.
• Why: Every day, there's a new efficient training algorithm; the ones we tried weren't that effective.
• Trivia: We started by trying some ideas that never outperformed our baseline; then realized that our baseline was quite competitive.
Kaddour, arXiv 2023
• What: Using embeddings and k-means, I construct a small and clean yet diverse pretraining corpus.
• Why: The Pile is too large for GPU-poor academics.
• Trivia: I reviewed examples of each k-means cluster during my daily tube commute.
Kaddour et al., NeurIPS 2022
• What: We can find even flatter minima than SAM by adding weight averaging.
• Why: SAM finds flat basins; WA finds flat points inside those basins.
Kaddour, NeurIPS 2022 HIIT Workshop
• What: Weight averaging = implicit LR decay.
• Why: We can evaluate intermediate checkpoints pre-LR decay, which is much cheaper.
Stojanovski et al., NeurIPS 2025 (Spotlight, Top 2%)
• What: 100+ RL envs across 8 domains with configurable complexity.
• Why: RL is so back thanks to R1. More envs, more data, more RL.
Tyukin et al., arXiv 2024
• What: We can remove up to 33% of the attention layers in Llama2 with negligible performance loss.
• Why: Removing attention layers makes inference faster and cheaper.
Key et al., NeurIPS 2023 WANT
• What: A method for fine-tuning an arbitrarily large model chunk by chunk (in isolation).
• Why: Allowing the GPU-poor to fine-tune some LLMs too.
• Trivia: Inspired by distributed training techniques, adopted for single-GPU fine-tuning.
Kaddour and Liu, arXiv 2023
• What: Knowledge distillation via synthetic data generation after fine-tuning of the teacher.
• Why: Teachers are more sample-efficient; by fine-tuning them, we can generate synthetic data for students.
Phan et al., arXiv 2025
• What: A really hard multiple-choice science benchmark for LLMs.
• Why: Previous benchmarks got hillclimbed quickly but this one will remain the last one standing (trust me, bro).
Zhuo et al., ICLR 2025 (Oral, Top 2%)
• What: 1k+ diverse, multi-tool-use programming tasks in Python.
• Why: Other code benchmarks were too monotonous (eg. Django) and lacked tool calls.
Gema et al., NAACL 2025
• What: We expose serious flaws in MMLU and release a smaller and cleaner version, MMLU-Redux.
• Why: MMLU is one of the most popular LLM benchmarks; better benchmarks, better models.
Wang et al., NeurIPS 2023
• What: A probing suite to profile molecular graph embeddings.
• Why: Downstream-only evaluations can be misleading; better probes yield more faithful assessments.
Lynch et al., ICLR 2025 SCSL
• What: A vision dataset of cute dogs with spurious correlations between dog breeds and backgrounds.
• Why: Spurious correlations harm the reliability of vision models; previous benchmarks were too easy.
Kaddour et al., Github (5.6k stars)
• What: A Python package with UI for building and debugging agents. Used by several enterprises.
• Why: Debugging long-running agents in a terminal gets cumbersome.
• Trivia: Building this taught me a lot about frontend and TypeScript.
Kaddour et al., arXiv 2023
• What: An opinionated review of 16 challenges for LLMs.
• Why: The field is moving fast, hard to keep up with what's worth solving.
• Trivia: This doc started as notes I took during an internship to teach myself about LLMs.
Yin et al., arXiv 2023
• What: We generate synthetic training data for vision classification models.
• Why: You can think of it as knowledge distillation from generative to discriminative models.
• Trivia: This is sort of the training-equivalent of Spawrious.
Kaddour et al., Foundations and Trends in Optimization, 2022
• What: A survey of how causality can be applied to ML problems.
• Why: Causality allows you to make assumptions about the data-generating process.
• Trivia: 3 years later, I'm surprised how far we've come with LLMs without any causality.
Kaddour et al., NeurIPS 2021
• What: We generalize the Robinson decomposition to treatment embeddings.
• Why: We can now use eg., an embedding of a drug's molecular graph.
Kaddour et al., NeurIPS 2020
• What: We make meta-learning more sample-efficient by letting the model guide the task selection.
• Why: Acquiring datasets can be expensive and slow. Let's make sure we make it worth it.
Notes on game design
How to make outputs more diverse and creative
Less criticizing, more construction.