Teaching LLMs to Teach

compute-efficient, pedagogically grounded math tutoring

Jan - March 2026

PyTorchNLPLLMsFine-tuning

Project poster: Adaptive Test-Time Compute for Pedagogically Grounded Reasoning in LLMs — our project poster, with teammates Medhya Goel and Jason Chon

With two teammates, I worked on making large language models better, cheaper math tutors. Our paper, Adaptive Test-Time Compute for Pedagogically Grounded Reasoning in LLMs, started from a tension: the models that explain math well are usually too large and expensive for widespread educational use. We studied compute-efficient ways to raise the pedagogical quality of smaller open-source models.

We framed it as a teacher–student setup on the MATH dataset: a stronger model writes an explanation, and a weaker “student” model tries to solve the problem using it. That let us measure explanations by whether they actually help someone reach the right answer, not just whether the explanation itself is correct. It is a close cousin of knowledge distillation, where a stronger model’s outputs are used to improve a weaker one, except here we distill for explanation quality and downstream learning rather than raw accuracy.

Our first approach used token-level entropy as a signal of the teacher’s uncertainty. Instead of spending a fixed compute budget on every problem, we regenerated an explanation only when the model looked unsure, allocating test-time compute where it was actually needed.

Student accuracy across entropy-gated strategies — student accuracy across the entropy-gated strategies

Teacher explanations significantly improved student accuracy over a no-explanation baseline, and the entropy-gated approach kept improving results while spending compute selectively rather than uniformly.

Our second approach finetuned a verifier: a model that predicts whether an explanation will lead to downstream student success, optimizing for pedagogical usefulness rather than just answer correctness. This avoided the complex, expensive training setups verifiers usually require.

Metrics related to the predictions of the final verifier — verifier training curves under different input settings

Verifier training curves under different input settings — how well the verifier predicts which explanations will actually help

The finetuned verifier reached 71.88% test accuracy, substantially outperforming a prompted baseline. Together, adaptive compute and pedagogically grounded verification point toward a way to get better explanations out of smaller, cheaper models, which is what it would take to make good tutoring genuinely accessible.

← back to work