Build a Vintage LLM for the Price of Lunch: A 1900-Locked Model, From Scratch
How croqaz trained a 340M-param, 1900-cutoff LLM for ~$80 — a clone-and-run template for an atelier's own domain-locked house model. Data curation is the job.
A developer who goes by croqaz spent three months — every single day, sick days included — building a language model that has never heard of the First World War. It is called Vintage LLM: English-only, Llama architecture, 340 million parameters, and a deliberate knowledge cutoff of the year 1900. Total bill for the whole adventure: about $80 in rented GPU time. That is not a typo, and it is the most interesting number in this story.
The honest framing matters. “From scratch” here means croqaz wrote his own data-processing pipelines, base-training scripts, fine-tuning scripts, and custom datasets — not that he hand-rolled matrix multiply in Assembly. He used PyTorch and existing tooling like everyone else, then hand-validated every function he half-vibe-coded with an assistant. The lineage is open too: he credits Hayk Grigorian, whose model trained only on 1800s London texts (a 90GB corpus) famously surfaced a real 1834 street protest straight out of the training data. There is now a small genre of these — GPT-1900, Mr. Chatterbox, TypewriterLM — historians-by-accident, each bounded to an era to study its voice and its blind spots.
←TODAY: In 2026 a single architect can train a domain-locked LLM for the cost of a team lunch, on rented cloud GPUs. →3012: Every atelier keeps its own house model, trained on its own canon, openly inspectable, outliving the studio that made it. Fulcrum: The model is only worth keeping if a 25-year-old can still open the file when the vendor is gone — which is exactly why you build it on open weights and public-domain data now.
The Tool: The project is Vintage LLM by croqaz (github.com/croqaz/vintage-LLM), with the trained base model published openly on HuggingFace. It is worth a computational designer’s afternoon for one reason: it is the smallest honest end-to-end example of the full stack — curate data, train a tokenizer, base-train, fine-tune — that you can actually run and read in one sitting. Pair it with Andrej Karpathy’s nanoGPT and zero-to-hero materials and you have the de-facto curriculum for “how is this thing actually built.”
Setup:
git clone https://github.com/croqaz/vintage-LLM
cd vintage-LLM
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# pull the published 340M base model and inspect it
python -c "from huggingface_hub import snapshot_download; \
snapshot_download('croqaz/vintage-LLM-340m-v1-base')"
First steps:
- Read the repo’s data-processing scripts first, not the training code. That is where the real work lives — croqaz says dataset processing and in-memory de-duplication ate the most time and RAM.
- Run his small pre-training config on the toy model (the
pythia-14m-class experiment) on your own machine before renting anything. Watch the validation-loss curve. - Change ONE thing — randomise your file chunks before tokenizing. croqaz’s early loss curves spiked precisely because clean books were tokenized in alphabetical order and the noisy OCR set arrived last. Order poisons training.
- Only then rent a GPU (RunPod, Vast.ai) for the larger run. Compute is the cheap commodity; your curated data is the asset.
Atelier: Forget Victorian chatbots for a second — swap the corpus. A Swiss studio could train the same 340M shell on its own archive: every competition entry, every Bauleitung note, every detail drawing’s text, every BEP. Not to replace anyone, but to have a model that speaks the house’s language and survives the next software migration. The method (curate → dedupe → train small → fine-tune) is identical to the IFC and BIM data hygiene PAZ teams already grind through. The lesson is the same one we preach on the site: data curation, not model code, is the job. Bad data in, confident nonsense out.
Hack: This Hack teaches you to load a time-locked model and hear the year 1900 answer back. The medium is runnable code; the domain is AI/ML — a three-line inference call against the published weights.
from transformers import pipeline
gen = pipeline("text-generation", model="croqaz/vintage-LLM-340m-v1-base")
print(gen("The new railway from London", max_new_tokens=40)[0]["generated_text"])
One intention: prove to yourself that a 0.3B model, trained for the price of lunch, holds a coherent worldview — a narrow, dated, occasionally wrong one, because croqaz ran zero alignment on purpose to keep the period accuracy intact. That trade-off is the whole point: a censored 1900 is no longer 1900.
Here is the part that should reach past the hobby. The buildings that aged worst in my time were never the ugly ones — they were the ones nobody could repair because the proprietary format went dark. A model is a file too. croqaz built his on open weights and public-domain text (pre-1900 is out of copyright — a clean illustration of lawful training data under EU rules). That is not nostalgia; that is the only version of this that a 25-year-old can still open in 2051. If you train a house model this quarter, ask the one question that changed how my generation built: when the vendor disappears, can someone still load the file?
Learn-it:
- The project repo: github.com/croqaz/vintage-LLM — croqaz’s full pipeline, data scripts first.
- The write-up: Making a vintage LLM from scratch — the honest, mistake-by-mistake build log.
- Build-it-yourself course: rasbt/LLMs-from-scratch — Sebastian Raschka’s step-by-step PyTorch implementation.
- PAZ note: the same curate-and-dedupe discipline that wins here is the discipline that keeps your IFC round-trips clean — treat your atelier’s text archive as a trainable corpus, not a dead folder.
Clone the repo this week, read the data scripts before the model code, and run the three-line Hack to feel what a $80 worldview sounds like.
SOURCE · ↗
PAZ Kaffi · multidisciplinary editorial, led by PAZ Academy