EEGym — Sam Hutchinson

Abstract

We present EEGym, a standardized benchmark aggregating N400 ERP data from three independent labs (~46k trials across 98 subjects). We compare representations from GPT-2, Pythia, Qwen3, and a Sentence Gestalt model on their ability to predict human neural responses to language.

Key findings

LLM surprisal strongly predicts N400, especially from the final layer and mid-to-late layers
Pre-trained base models outperform instruction-tuned models
Predictivity scales with model size

Downloads

Data

combined_clean_n400.csv

N400 amplitudes · 98 subjects · 46k trials

stims_for_modeling.csv

Sentences + critical words for modeling

Notebooks

pipeline.ipynb

Main analysis pipeline

lopopolo_pipeline.ipynb

Sentence Gestalt model pipeline

final_plots.ipynb

Figure reproduction

Figure 1 — Predictor comparison (ΔAIC vs. null model)

Higher ΔAIC = stronger prediction of N400 amplitude. Significance stars shown on hover. Use tabs to switch between predictors.

Figure 2 — Layer-wise surprisal profile

ΔAIC as a function of relative transformer depth (0 = input layer, 1 = last layer). Only GPT-2 and Pythia families included (tuned-lens probes required). Click legend entries to toggle models.

Figure 3 — Effect of model scale

Last-layer surprisal predictivity (ΔAIC) as a function of parameter count (log scale). Hover for model name and exact values.