EEGym: a novel benchmark for comparing language model representations and human N400 responses

Sam Hutchinson  ·  Jenny Baek  ·  Ali Cy
MIT 6.8610 Grad NLP
PDF

Abstract

We present EEGym, a standardized benchmark aggregating N400 ERP data from three independent labs (~46k trials across 98 subjects). We compare representations from GPT-2, Pythia, Qwen3, and a Sentence Gestalt model on their ability to predict human neural responses to language.

Key findings

  • LLM surprisal strongly predicts N400, especially from the final layer and mid-to-late layers
  • Pre-trained base models outperform instruction-tuned models
  • Predictivity scales with model size

Figure 1 — Predictor comparison (ΔAIC vs. null model)

Higher ΔAIC = stronger prediction of N400 amplitude. Significance stars shown on hover. Use tabs to switch between predictors.

Figure 2 — Layer-wise surprisal profile

ΔAIC as a function of relative transformer depth (0 = input layer, 1 = last layer). Only GPT-2 and Pythia families included (tuned-lens probes required). Click legend entries to toggle models.

Figure 3 — Effect of model scale

Last-layer surprisal predictivity (ΔAIC) as a function of parameter count (log scale). Hover for model name and exact values.