Behind the Humanoid Boom: The Stack Architects Now Have to Specify
MIT Tech Review's robot-learning history hides a three-layer dependency stack. Architects specifying robotic fabrication in 2026 should draw it now.
MIT Technology Review’s How robots learn: A brief, contemporary history walks through three eras of the field — scripted rules, simulated trial-and-error, and the post-ChatGPT shift to models that ingest pixels, joint angles, and force readings and predict the next motor command thirty times a second. The piece headlines a real number: $6.1 billion went into humanoid robots in 2025, four times the 2024 figure. The narrative is clean. The dependency graph underneath is not.
←TODAY: $6.1B into humanoids in 2025; vision-language-action policies (RT-2, OpenVLA, π0) are shipping; LeRobot publishes real trajectory datasets on Hugging Face. →3012: A Zurich site shares its working envelope with semi-autonomous arms whose policies were trained on someone else’s footage. Fulcrum: Whoever owns the training trajectories owns the body.
The three layers under the headline
Strip the press release and you find a stack. Compute first — NVIDIA H100 / GB200 clusters and the cooling that keeps them alive. Then simulation infrastructure: NVIDIA Isaac Sim, MuJoCo, Genesis — the digital twins where domain randomization burns through millions of synthetic cube-rotations to make a policy robust to a real fingertip’s grip. Then trajectory data: the Open X-Embodiment corpus, the LeRobot datasets, the proprietary teleoperation logs that Tesla, Figure, and 1X will not show you. The “foundation policy” sits at the top of that stack, and every demo — OpenAI’s Dactyl in 2018, the new VLA models now — quietly inherits every layer below it.
That stack is what changed in 2022. Cynthia Breazeal’s Jibo, the lamp-shaped social robot that raised $3.7M on crowdfunding and shipped scripted snippets to children, did not fail because of bad industrial design. It failed because the language layer was not there yet. The post-ChatGPT generation has the language layer; the trade-off is that the layer lives in someone else’s data center, on someone else’s weights, behind someone else’s billing rate.
This is where the PAZ archive’s Attention (transformers) — En Obra panel earns its place: the lineage from Transformer (2017) to ViT (2020) to Point Transformer V3 (2024) is the same operator that lets a robot ingest a LiDAR scan, a natural-language instruction, and a joint-state vector in one forward pass. The operator works. The supply chain underneath does not yet — and a working architect specifying robotic fabrication in 2026 is exposed to every link of it.
Atelier: When PAZ briefs a robotic fabrication module — the Dougong-joint research, the BIM-to-BoT timber-framing work, the panel-assembly papers coming out of the construction-robotics lane — we now write two parallel specifications: the toolpath for the scripted regime and the policy contract for the learned regime (which model, which dataset lineage, what the arm does when the sensor stream degrades). The second document did not exist three years ago. In a Swiss Wettbewerb, it is the document that allocates the risk.
Hack: This Hack teaches you to download a real robot-learning trajectory dataset so you can see what the policy actually trains on — joint angles, gripper commands, matched camera frames. Pick LeRobot’s bimanual cube-transfer set; it is small enough to inspect on a laptop and serious enough to reveal the data format every humanoid lab is converging on.
pip install datasets huggingface_hub
python -c "
from datasets import load_dataset
ds = load_dataset('lerobot/aloha_sim_transfer_cube_human', split='train')
print(ds.column_names)
print(ds[0]['observation.state'][:3], ds[0]['action'][:3])
"
You will see vectors of joint positions paired with action commands. That is the entire vocabulary of a 2026 humanoid policy. Read three episodes; you will understand more about the field than most pitch decks contain.
Here is the part I remember from where I am writing. We did not run out of compute in my time. We ran out of intact cooling, intact bandwidth, and intact people who remembered how the old system worked. The humanoid boom of 2025 was real and most of it survived — but only the projects that drew their actual dependency graph survived intact. Draw yours this week. Not the marketing architecture diagram. The real one: model provider, weight host, simulation backend, sensor firmware, the one teleoperator whose recordings half your policy was trained on. The third single point of failure you did not know you had is the whole point of the exercise.
Source: MIT Technology Review
SOURCE · ↗