Abstract
We present EEGym, a standardized benchmark aggregating N400 ERP data from three independent labs (~46k trials across 98 subjects). We compare representations from GPT-2, Pythia, Qwen3, and a Sentence Gestalt model on their ability to predict human neural responses to language.
Key findings
- LLM surprisal strongly predicts N400, especially from the final layer and mid-to-late layers
- Pre-trained base models outperform instruction-tuned models
- Predictivity scales with model size
Figure 1 — Predictor comparison (ΔAIC vs. null model)
Higher ΔAIC = stronger prediction of N400 amplitude. Significance stars shown on hover. Use tabs to switch between predictors.
Figure 2 — Layer-wise surprisal profile
ΔAIC as a function of relative transformer depth (0 = input layer, 1 = last layer). Only GPT-2 and Pythia families included (tuned-lens probes required). Click legend entries to toggle models.
Figure 3 — Effect of model scale
Last-layer surprisal predictivity (ΔAIC) as a function of parameter count (log scale). Hover for model name and exact values.