Web guide for building & training GPT-2 124M (hibiki)

hibiki is a from-scratch GPT-2 124M training repo with a companion web UI (web/). It follows Andrej Karpathy’s walkthroughs (Let’s reproduce GPT-2 (124M), Let’s build the GPT Tokenizer): Python training (train_gpt2.py) plus a Vite+React app for commands, diagrams, and study notes.

Repository

hibiki — GPT-2 124M from scratch

https://github.com/calicorone/hibiki

Python 3.10+ · train_gpt2.py · Vite+React web/ · KO/EN i18n

Home screen

Dark UI, gold accents, Overview / Build / Visualization / Learn / Attention paper nav, plus a KO | EN toggle.

hibiki web home — GPT-2 124M from scratch, Goals, Architecture

hibiki web UI — local npm run dev (Vite default port 5173)

Layout

Path	Role
Repo root	`train_gpt2.py`, model/data/training (Python 3.10+)
`web/`	Vite + React — command builder, pipeline/architecture views, study guide, Attention paper

Python environment

pip install -e .
# Optional: Hugging Face checkpoint checks, etc.
pip install -e ".[pretrained]"

Copy generated commands from the Build model page or follow train_gpt2.py --help.

Web UI (`web/`)

cd web
npm install
npm run dev

Open http://localhost:5173. Production build: npm run build → web/dist/.

Main routes

Route	Content
`/`	Overview, architecture/optim summary, quick start
`/build`	Modes & hyperparameters → pasteable `train_gpt2.py` command
`/viz`	Pipeline, GPT-2 block diagram, four pillars (embed · attention · MLP · unembed), interactive tokenizer tab
`/learn`	Step-by-step implementation/training concepts
`/attention-is-all-you-need`	Attention Is All You Need notes, KaTeX + SVG, links to viz

Header KO / EN persists language in localStorage.

Feature recap

Build: overfit / pretrained / train modes, hyperparameters, one-line explanations per flag.
Learn: chapters 1–8 + appendix—sequences, Pre-LN, loss, backward, AdamW, …
Visualization: pipeline cards, 12 blocks, Tokenization tab (BPE, encode/decode diagrams, chapter links).

Stack

Front: React 18, TypeScript, Vite, React Router
Styling: CSS variables, dark global theme
i18n: custom ko/en + localStorage

Training stays local via train_gpt2.py (no backend in the web app).

Dev notes

Run cd web && npm install after clone (node_modules not committed).
.cursorrules captures model/coding guidance (Pre-LN, weight tying, Flash Attention, …).

Audience

Anyone implementing GPT-2 124M from scratch with Karpathy-style guidance
Beginners tuning LR, batch size, gradient accumulation
Teams onboarding via Build → Learn → Visualization

How an LLM works (short)

An autoregressive LM predicts the next token given prior context. Conceptually:

Tokenizer → token embedding → positional/context encoding → Transformer blocks → logits → softmax → sampling → decode

Tokenizer maps text to subword IDs (often BPE). Embeddings turn IDs into vectors; positional information fixes order; self-attention mixes token representations; FFN refines each position; a final linear + softmax yields next-token probabilities. Training minimizes cross-entropy against the true next token; generation uses greedy, top-k, top-p, or temperature sampling.

For the full Korean walkthrough with tables, formulas ($p_j = \frac{e^{z_j}}{\sum_k e^{z_k}}$, etc.), and section-by-section detail, see the Korean edition of this post.