Jean Kaddour

Currently, I'm developing PySpur, a Graph-Based Editor for AI agents. At some point, I will graduate with my PhD in LLMs supervised by Ricardo Silva and Matt Kusner at UCL. I am based in London, UK.

Publications

[0] REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Stojanovski et al., arXiv 2025

What: 100+ RL envs across 8 domains with configurable complexity.

Why: RL is so back thanks to R1. More envs, more data, moreRL.

[1] PySpur: A visual playground for AI Agents
Kaddour et al., Github 2025

What: A Python package with UI for building and debugging agent scaffolds. Used by several enterprises.

Why: Debugging long-running agents in a terminal gets cumbersome.

[2] Humanity's Last Exam
Phan et al., arXiv 2025

What: A really hard multiple-choice science benchmark for LLMs.

Why: Previous benchmarks got hillclimbed quickly but this one will remain the last one standing (trust me, bro).

[3] BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Zhuo et al., ICLR 2025 (Oral)

What: 1k+ diverse, multi-tool-use programming tasks in Python.

Why: Other code benchmarks were too monotonous (eg. Django) and lacked tool calls.

[4] Are We Done with MMLU?
Gema et al., NAACL 2025

What: We expose serious flaws in MMLU and release a smaller and cleaner version, MMLU-Redux.

Why: MMLU is one of the most popular LLM benchmarks; better benchmarks, better models.

[5] Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
Tyukin et al., arXiv 2024

What: We can remove up to 33% of the attention layers in Llama2 with negligible performance loss.

Why: Removing attention layers makes inference faster and cheaper.

[6] Challenges and applications of large language models
Kaddour et al., arXiv 2023

What: An opinionated review of 16 challenges for LLMs.

Why: The field is moving fast, hard to keep up with what's worth solving.

Trivia: This doc started as notes I took during an internship to teach myself about LLMs.

[7] Early Weight Averaging meets High Learning Rates for LLM Pre-training
Sanyal et al., COLM 2024, NeurIPS 2023 WANT

What: We scale up LAWA (see below) to large models.

Why: Large model training -> large batch sizes -> large LRs -> LAWA makes (even more) sense.

[8] Local LoRA: Memory-Efficient Fine-Tuning of Large Language Models
Key et al., NeurIPS 2023 WANT

What: A method for fine-tuning an arbitrarily large model chunk by chunk (in isolation).

Why: Allowing the GPU-poor to fine-tune some LLMs too.

[9] Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models
Kaddour and Liu, arXiv 2023

What: Knowledge distillation via synthetic data generation after fine-tuning of the teacher.

Why: Teachers are more sample-efficient; by fine-tuning them, we can generate synthetic data for students.

[10] No train no gain: Revisiting efficient training algorithms for transformer-based language models
Kaddour et al., NeurIPS 2023

What: A simple budget-aware LR scheduler outperforms most fancy efficient training methods.

Why: Every day, there's a new efficient training algorithm; the ones we tried weren't that effective.

Trivia: This started by trying some novel ideas that never outperformed our baseline; then realizing that our baseline was quite competitive.

[11] Evaluating Self-Supervised Learning for Molecular Graph Embeddings
Wang et al., NeurIPS 2023

What: A probing suite to profile molecular graph embeddings and evaluate GSSL methods.

Why: Downstream-only evaluations can be misleading; better probes yield more faithful assessments.

[12] Ttida: Controllable generative data augmentation via text-to-text and text-to-image models
Yin et al., arXiv 2023

What: We generate synthetic training data for vision classification models.

Why: You can think of it as knowledge distillation from generative to discriminative models.

Trivia: This is sort of the training-equivalent of Spawrious (see below).

[13] Minipile: A Challenge for Data-Efficient Language Models
Kaddour, arXiv 2023

What: Using embeddings and k-means, I construct a small and clean yet diverse pretraining corpus.

Why: The Pile is too large for GPU-poor academics.

Trivia: I reviewed examples of each k-means cluster during my daily tube commute.

[14] Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases
Lynch et al., ICLR 2025 SCSL

What: A vision dataset of cute dogs with spurious correlations between dog breeds and backgrounds.

Why: Spurious correlations harm the reliability of vision models; previous benchmarks were too easy.

[15] When Do Flat Minima Optimizers Work?
Kaddour et al., NeurIPS 2022

What: We can find even flatter minima than SAM by adding weight averaging.

Why: SAM finds flat basins; WA finds flat points inside those basins.

[16] Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging
Kaddour, NeurIPS 2022 HIIT Workshop

What: Surprisingly, LAtest Weight Averaging (LAWA), ie., SWA in a FIFO way, is almost identical to decaying the LR.

Why: Training runs can run for months, wouldn't it be nice to make better use of intermediate checkpoints?

Trivia: NeurIPS folks thought I had a bug in my code, until it got confirmed by several other works.

[17] Causal Machine Learning: A Survey and Open Problems
Kaddour et al., Foundations and Trends in Optimization 2022

What: A survey of how causality can be applied to ML problems.

Why: Causality allows you to use assumptions about the data-generating process into your model.

[18] Causal Effect Inference for Structured Treatments
Kaddour et al., NeurIPS 2021

What: We generalize the Robinson decomposition to continuous vector treatments.

Why: In medicine or economics, we often have continuous, multivariate data of treatments.

Trivia: Made me realize that causal inference research lacks meaningful benchmarks.

[19] Probabilistic Active Meta-Learning
Kaddour et al., NeurIPS 2020

What: We make meta-learning more sample-efficient by letting the model guide the task selection.

Why: Acquiring task-specific datasets can be expensive and slow. Let's make sure we make it worth it.

Trivia: This was my Master's thesis while studying at the wonderful Imperial College London.

Recent Blog Posts

View all posts →