Bottom-up vs Top-down
Bottom-up vs Top-down
Currently, I'm developing PySpur, a Graph-Based Editor for AI agents. At some point, I will graduate with my PhD in LLMs supervised by Ricardo Silva and Matt Kusner at UCL. I am based in London, UK.
[0] REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Stojanovski et al., arXiv 2025
• What: 100+ RL envs across 8 domains with configurable complexity.
• Why: RL is so back thanks to R1. More envs, more data, moreRL.
[1] PySpur: A visual playground for AI Agents
Kaddour et al., Github 2025
• What: A Python package with UI for building and debugging agent scaffolds. Used by several enterprises.
• Why: Debugging long-running agents in a terminal gets cumbersome.
[2] Humanity's Last Exam
Phan et al., arXiv 2025
• What: A really hard multiple-choice science benchmark for LLMs.
• Why: Previous benchmarks got hillclimbed quickly but this one will remain the last one standing (trust me, bro).
[3] BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Zhuo et al., ICLR 2025 (Oral)
• What: 1k+ diverse, multi-tool-use programming tasks in Python.
• Why: Other code benchmarks were too monotonous (eg. Django) and lacked tool calls.
[4] Are We Done with MMLU?
Gema et al., NAACL 2025
• What: We expose serious flaws in MMLU and release a smaller and cleaner version, MMLU-Redux.
• Why: MMLU is one of the most popular LLM benchmarks; better benchmarks, better models.
[5] Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
Tyukin et al., arXiv 2024
• What: We can remove up to 33% of the attention layers in Llama2 with negligible performance loss.
• Why: Removing attention layers makes inference faster and cheaper.
[6] Challenges and applications of large language models
Kaddour et al., arXiv 2023
• What: An opinionated review of 16 challenges for LLMs.
• Why: The field is moving fast, hard to keep up with what's worth solving.
• Trivia: This doc started as notes I took during an internship to teach myself about LLMs.
[7] Early Weight Averaging meets High Learning Rates for LLM Pre-training
Sanyal et al., COLM 2024, NeurIPS 2023 WANT
• What: We scale up LAWA (see below) to large models.
• Why: Large model training -> large batch sizes -> large LRs -> LAWA makes (even more) sense.
[8] Local LoRA: Memory-Efficient Fine-Tuning of Large Language Models
Key et al., NeurIPS 2023 WANT
• What: A method for fine-tuning an arbitrarily large model chunk by chunk (in isolation).
• Why: Allowing the GPU-poor to fine-tune some LLMs too.
[9] Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models
Kaddour and Liu, arXiv 2023
• What: Knowledge distillation via synthetic data generation after fine-tuning of the teacher.
• Why: Teachers are more sample-efficient; by fine-tuning them, we can generate synthetic data for students.
[10] No train no gain: Revisiting efficient training algorithms for transformer-based language models
Kaddour et al., NeurIPS 2023
• What: A simple budget-aware LR scheduler outperforms most fancy efficient training methods.
• Why: Every day, there's a new efficient training algorithm; the ones we tried weren't that effective.
• Trivia: This started by trying some novel ideas that never outperformed our baseline; then realizing that our baseline was quite competitive.
[11] Evaluating Self-Supervised Learning for Molecular Graph Embeddings
Wang et al., NeurIPS 2023
• What: A probing suite to profile molecular graph embeddings and evaluate GSSL methods.
• Why: Downstream-only evaluations can be misleading; better probes yield more faithful assessments.
[12] Ttida: Controllable generative data augmentation via text-to-text and text-to-image models
Yin et al., arXiv 2023
• What: We generate synthetic training data for vision classification models.
• Why: You can think of it as knowledge distillation from generative to discriminative models.
• Trivia: This is sort of the training-equivalent of Spawrious (see below).
[13] Minipile: A Challenge for Data-Efficient Language Models
Kaddour, arXiv 2023
• What: Using embeddings and k-means, I construct a small and clean yet diverse pretraining corpus.
• Why: The Pile is too large for GPU-poor academics.
• Trivia: I reviewed examples of each k-means cluster during my daily tube commute.
[14] Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases
Lynch et al., ICLR 2025 SCSL
• What: A vision dataset of cute dogs with spurious correlations between dog breeds and backgrounds.
• Why: Spurious correlations harm the reliability of vision models; previous benchmarks were too easy.
[15] When Do Flat Minima Optimizers Work?
Kaddour et al., NeurIPS 2022
• What: We can find even flatter minima than SAM by adding weight averaging.
• Why: SAM finds flat basins; WA finds flat points inside those basins.
[16] Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging
Kaddour, NeurIPS 2022 HIIT Workshop
• What: Surprisingly, LAtest Weight Averaging (LAWA), ie., SWA in a FIFO way, is almost identical to decaying the LR.
• Why: Training runs can run for months, wouldn't it be nice to make better use of intermediate checkpoints?
• Trivia: NeurIPS folks thought I had a bug in my code, until it got confirmed by several other works.
[17] Causal Machine Learning: A Survey and Open Problems
Kaddour et al., Foundations and Trends in Optimization 2022
• What: A survey of how causality can be applied to ML problems.
• Why: Causality allows you to use assumptions about the data-generating process into your model.
[18] Causal Effect Inference for Structured Treatments
Kaddour et al., NeurIPS 2021
• What: We generalize the Robinson decomposition to continuous vector treatments.
• Why: In medicine or economics, we often have continuous, multivariate data of treatments.
• Trivia: Made me realize that causal inference research lacks meaningful benchmarks.
[19] Probabilistic Active Meta-Learning
Kaddour et al., NeurIPS 2020
• What: We make meta-learning more sample-efficient by letting the model guide the task selection.
• Why: Acquiring task-specific datasets can be expensive and slow. Let's make sure we make it worth it.
• Trivia: This was my Master's thesis while studying at the wonderful Imperial College London.
Bottom-up vs Top-down
Quick Intro to FlashMLA, DeepEP, DeepGEMM, DualPipe, EPLB, 3FS and Smallpond