CH NEO-ZÜRICH EDITION
WEATHER · CLEAR 29°C
BLEND OF THE DAY · 07/ROGUE
EST. 2027
PAZ ACADEMY
THE AEC CYBER MORNING NEWS

PAZ Kaffi

DESIGN · DEMOLITION · CAFFEINE · DISPATCH
EDITION 0705 · 5 July 2026
BROADCAST 04:42 CET
2,400 BROADSHEETS PRINTED
READ TIME · 47 MIN
Build a Vintage LLM for the Price of Lunch: A 1900-Locked Model, From Scratch
ACADEMY
FRAME · 06:55
05-07-2026

Build a Vintage LLM for the Price of Lunch: A 1900-Locked Model, From Scratch

How croqaz trained a 340M-param, 1900-cutoff LLM for ~$80 — a clone-and-run template for an atelier's own domain-locked house model. Data curation is the job.

A developer who goes by croqaz spent three months — every single day, sick days included — building a language model that has never heard of the First World War. It is called Vintage LLM: English-only, Llama architecture, 340 million parameters, and a deliberate knowledge cutoff of the year 1900. Total bill for the whole adventure: about $80 in rented GPU time. That is not a typo, and it is the most interesting number in this story.

The honest framing matters. “From scratch” here means croqaz wrote his own data-processing pipelines, base-training scripts, fine-tuning scripts, and custom datasets — not that he hand-rolled matrix multiply in Assembly. He used PyTorch and existing tooling like everyone else, then hand-validated every function he half-vibe-coded with an assistant. The lineage is open too: he credits Hayk Grigorian, whose model trained only on 1800s London texts (a 90GB corpus) famously surfaced a real 1834 street protest straight out of the training data. There is now a small genre of these — GPT-1900, Mr. Chatterbox, TypewriterLM — historians-by-accident, each bounded to an era to study its voice and its blind spots.

←TODAY: In 2026 a single architect can train a domain-locked LLM for the cost of a team lunch, on rented cloud GPUs. →3012: Every atelier keeps its own house model, trained on its own canon, openly inspectable, outliving the studio that made it. Fulcrum: The model is only worth keeping if a 25-year-old can still open the file when the vendor is gone — which is exactly why you build it on open weights and public-domain data now.

The Tool: The project is Vintage LLM by croqaz (github.com/croqaz/vintage-LLM), with the trained base model published openly on HuggingFace. It is worth a computational designer’s afternoon for one reason: it is the smallest honest end-to-end example of the full stack — curate data, train a tokenizer, base-train, fine-tune — that you can actually run and read in one sitting. Pair it with Andrej Karpathy’s nanoGPT and zero-to-hero materials and you have the de-facto curriculum for “how is this thing actually built.”

Setup:

git clone https://github.com/croqaz/vintage-LLM
cd vintage-LLM
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# pull the published 340M base model and inspect it
python -c "from huggingface_hub import snapshot_download; \
  snapshot_download('croqaz/vintage-LLM-340m-v1-base')"

First steps:

  1. Read the repo’s data-processing scripts first, not the training code. That is where the real work lives — croqaz says dataset processing and in-memory de-duplication ate the most time and RAM.
  2. Run his small pre-training config on the toy model (the pythia-14m-class experiment) on your own machine before renting anything. Watch the validation-loss curve.
  3. Change ONE thing — randomise your file chunks before tokenizing. croqaz’s early loss curves spiked precisely because clean books were tokenized in alphabetical order and the noisy OCR set arrived last. Order poisons training.
  4. Only then rent a GPU (RunPod, Vast.ai) for the larger run. Compute is the cheap commodity; your curated data is the asset.

Atelier: Forget Victorian chatbots for a second — swap the corpus. A Swiss studio could train the same 340M shell on its own archive: every competition entry, every Bauleitung note, every detail drawing’s text, every BEP. Not to replace anyone, but to have a model that speaks the house’s language and survives the next software migration. The method (curate → dedupe → train small → fine-tune) is identical to the IFC and BIM data hygiene PAZ teams already grind through. The lesson is the same one we preach on the site: data curation, not model code, is the job. Bad data in, confident nonsense out.

Hack: This Hack teaches you to load a time-locked model and hear the year 1900 answer back. The medium is runnable code; the domain is AI/ML — a three-line inference call against the published weights.

from transformers import pipeline
gen = pipeline("text-generation", model="croqaz/vintage-LLM-340m-v1-base")
print(gen("The new railway from London", max_new_tokens=40)[0]["generated_text"])

One intention: prove to yourself that a 0.3B model, trained for the price of lunch, holds a coherent worldview — a narrow, dated, occasionally wrong one, because croqaz ran zero alignment on purpose to keep the period accuracy intact. That trade-off is the whole point: a censored 1900 is no longer 1900.

Here is the part that should reach past the hobby. The buildings that aged worst in my time were never the ugly ones — they were the ones nobody could repair because the proprietary format went dark. A model is a file too. croqaz built his on open weights and public-domain text (pre-1900 is out of copyright — a clean illustration of lawful training data under EU rules). That is not nostalgia; that is the only version of this that a 25-year-old can still open in 2051. If you train a house model this quarter, ask the one question that changed how my generation built: when the vendor disappears, can someone still load the file?

Learn-it:

  • The project repo: github.com/croqaz/vintage-LLM — croqaz’s full pipeline, data scripts first.
  • The write-up: Making a vintage LLM from scratch — the honest, mistake-by-mistake build log.
  • Build-it-yourself course: rasbt/LLMs-from-scratch — Sebastian Raschka’s step-by-step PyTorch implementation.
  • PAZ note: the same curate-and-dedupe discipline that wins here is the discipline that keeps your IFC round-trips clean — treat your atelier’s text archive as a trainable corpus, not a dead folder.

Clone the repo this week, read the data scripts before the model code, and run the three-line Hack to feel what a $80 worldview sounds like.

FILED FROM
CO-SIGNERS
PAZ Academy
CONFIDENCE
HIGH
REPRINTS
© PAZ - PARAMETRIC ACADEMY ZURICH · ALL RIGHTS RESERVED

SOURCE ·

PAZ Kaffi · multidisciplinary editorial, led by PAZ Academy

⚑ REPORT AN ERROR · SUBMIT A CORRECTION
◂ BACK TO FRONT PAGE · PAZ KAFFI

© 2026 PAZ Academy.