Web guide for building & training GPT-2 124M (hibiki)
GPT-2 124M 빌드·학습을 도와주는 웹 가이드
hibiki is a from-scratch GPT-2 124M training repo with a companion web UI (web/). It follows Andrej Karpathy’s walkthroughs (Let’s reproduce GPT-2 (124M), Let’s build the GPT Tokenizer): Python training (train_gpt2.py) plus a Vite+React app for commands, diagrams, and study notes.
Repository
Home screen
Dark UI, gold accents, Overview / Build / Visualization / Learn / Attention paper nav, plus a KO | EN toggle.

hibiki web UI — local npm run dev (Vite default port 5173)
Layout
| Path | Role |
|---|---|
| Repo root | train_gpt2.py, model/data/training (Python 3.10+) |
web/ | Vite + React — command builder, pipeline/architecture views, study guide, Attention paper |
Python environment
pip install -e .
# Optional: Hugging Face checkpoint checks, etc.
pip install -e ".[pretrained]"
Copy generated commands from the Build model page or follow train_gpt2.py --help.
Web UI (web/)
cd web
npm install
npm run dev
Open http://localhost:5173. Production build: npm run build → web/dist/.
Main routes
| Route | Content |
|---|---|
/ | Overview, architecture/optim summary, quick start |
/build | Modes & hyperparameters → pasteable train_gpt2.py command |
/viz | Pipeline, GPT-2 block diagram, four pillars (embed · attention · MLP · unembed), interactive tokenizer tab |
/learn | Step-by-step implementation/training concepts |
/attention-is-all-you-need | Attention Is All You Need notes, KaTeX + SVG, links to viz |
Header KO / EN persists language in localStorage.
Feature recap
- Build: overfit / pretrained / train modes, hyperparameters, one-line explanations per flag.
- Learn: chapters 1–8 + appendix—sequences, Pre-LN, loss, backward, AdamW, …
- Visualization: pipeline cards, 12 blocks, Tokenization tab (BPE, encode/decode diagrams, chapter links).
Stack
- Front: React 18, TypeScript, Vite, React Router
- Styling: CSS variables, dark global theme
- i18n: custom ko/en +
localStorage
Training stays local via train_gpt2.py (no backend in the web app).
Dev notes
- Run
cd web && npm installafter clone (node_modulesnot committed). .cursorrulescaptures model/coding guidance (Pre-LN, weight tying, Flash Attention, …).
Audience
- Anyone implementing GPT-2 124M from scratch with Karpathy-style guidance
- Beginners tuning LR, batch size, gradient accumulation
- Teams onboarding via Build → Learn → Visualization
How an LLM works (short)
An autoregressive LM predicts the next token given prior context. Conceptually:
Tokenizer → token embedding → positional/context encoding → Transformer blocks → logits → softmax → sampling → decode
Tokenizer maps text to subword IDs (often BPE). Embeddings turn IDs into vectors; positional information fixes order; self-attention mixes token representations; FFN refines each position; a final linear + softmax yields next-token probabilities. Training minimizes cross-entropy against the true next token; generation uses greedy, top-k, top-p, or temperature sampling.
For the full Korean walkthrough with tables, formulas ($p_j = \frac{e^{z_j}}{\sum_k e^{z_k}}$, etc.), and section-by-section detail, see the Korean edition of this post.
References
Let’s reproduce GPT-2 (124M) — YouTube
Let’s build the GPT Tokenizer — YouTube
If commands feel heavy, follow Build → Learn → Visualization in the web app; flip KO / EN when sharing with a mixed team.