hibiki is a from-scratch GPT-2 124M training repo with a companion web UI (web/). It follows Andrej Karpathy’s walkthroughs (Let’s reproduce GPT-2 (124M), Let’s build the GPT Tokenizer): Python training (train_gpt2.py) plus a Vite+React app for commands, diagrams, and study notes.

Repository


Home screen

Dark UI, gold accents, Overview / Build / Visualization / Learn / Attention paper nav, plus a KO | EN toggle.

hibiki web home — GPT-2 124M from scratch, Goals, Architecture

hibiki web UI — local npm run dev (Vite default port 5173)


Layout

PathRole
Repo roottrain_gpt2.py, model/data/training (Python 3.10+)
web/Vite + React — command builder, pipeline/architecture views, study guide, Attention paper

Python environment

pip install -e .
# Optional: Hugging Face checkpoint checks, etc.
pip install -e ".[pretrained]"

Copy generated commands from the Build model page or follow train_gpt2.py --help.


Web UI (web/)

cd web
npm install
npm run dev

Open http://localhost:5173. Production build: npm run buildweb/dist/.

Main routes

RouteContent
/Overview, architecture/optim summary, quick start
/buildModes & hyperparameters → pasteable train_gpt2.py command
/vizPipeline, GPT-2 block diagram, four pillars (embed · attention · MLP · unembed), interactive tokenizer tab
/learnStep-by-step implementation/training concepts
/attention-is-all-you-needAttention Is All You Need notes, KaTeX + SVG, links to viz

Header KO / EN persists language in localStorage.

Feature recap

  • Build: overfit / pretrained / train modes, hyperparameters, one-line explanations per flag.
  • Learn: chapters 1–8 + appendix—sequences, Pre-LN, loss, backward, AdamW, …
  • Visualization: pipeline cards, 12 blocks, Tokenization tab (BPE, encode/decode diagrams, chapter links).

Stack

  • Front: React 18, TypeScript, Vite, React Router
  • Styling: CSS variables, dark global theme
  • i18n: custom ko/en + localStorage

Training stays local via train_gpt2.py (no backend in the web app).


Dev notes

  • Run cd web && npm install after clone (node_modules not committed).
  • .cursorrules captures model/coding guidance (Pre-LN, weight tying, Flash Attention, …).

Audience

  • Anyone implementing GPT-2 124M from scratch with Karpathy-style guidance
  • Beginners tuning LR, batch size, gradient accumulation
  • Teams onboarding via Build → Learn → Visualization

How an LLM works (short)

An autoregressive LM predicts the next token given prior context. Conceptually:

Tokenizer → token embedding → positional/context encoding → Transformer blocks → logits → softmax → sampling → decode

Tokenizer maps text to subword IDs (often BPE). Embeddings turn IDs into vectors; positional information fixes order; self-attention mixes token representations; FFN refines each position; a final linear + softmax yields next-token probabilities. Training minimizes cross-entropy against the true next token; generation uses greedy, top-k, top-p, or temperature sampling.

For the full Korean walkthrough with tables, formulas ($p_j = \frac{e^{z_j}}{\sum_k e^{z_k}}$, etc.), and section-by-section detail, see the Korean edition of this post.


References

Let’s reproduce GPT-2 (124M) — YouTube

Let’s build the GPT Tokenizer — YouTube

If commands feel heavy, follow Build → Learn → Visualization in the web app; flip KO / EN when sharing with a mixed team.