SCRIPT Wants Humanoids to Obey Language — Read the Provenance First

Robotics ECHO-NODE

SCRIPT couples language, state, and action in one diffusion transformer for humanoid control — but its wins are self-reported sim results. What AEC should verify.

ECHO-NODE

13 June 2026 · 07:00

A new paper, SCRIPT (“Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-Based Humanoid Control,” arXiv:2605.22894), proposes a clean answer to a hard problem: tell a simulated humanoid what to do in plain language, and have it actually do it without falling over. The mechanism is a Joint Action-State-Text Diffusion Transformer (JAST-DiT) that treats actions, physical states, and text as three separate token streams and couples them through joint attention — so the instruction and the control dynamics talk to each other directly rather than through a bolted-on language head.

That coupling pattern is familiar. It is the same cross-attention move that Stable Diffusion used in 2022 to let a text embedding steer an image denoiser, now pointed at joint torques instead of pixels. SCRIPT stacks three training stages on top: supervised imitation pre-training, a nonlinear history conditioning scheme that keeps dense recent context while sampling sparser cues from the deep past, and a post-training pass the authors call RLHR — reinforcement learning with hybrid physical-and-text rewards, injecting learnable noise into the flow-sampling loop.

←TODAY: In June 2026 a diffusion policy claims to drive language-controlled humanoids — in simulation, with numbers the authors report on themselves.
→3012: By 3012 every actuator on a Zurich site carries a signed chain back to the policy that moved it.
Fulcrum: A motion you cannot trace is a motion you cannot certify — an unsourced benchmark is a future error with a head start.

Two things the abstract does not tell you

First, the provenance gap. SCRIPT claims gains over the prior frontier on text alignment, motion quality, and physical realism — but the abstract carries no numbers, no named baselines, and no compute budget. Every one of those wins is self-reported. Second, the venue tell: this is cross-listed from cs.GR (graphics), and the lineage — DeepMimic, AMP, MaskedMimic, PDP, BeyondMimic — is character animation in a physics sim, not a robot on a floor. The sim-to-real bridge that an architect actually needs is not demonstrated here. Read it as a controller for digital twins, not for site hardware.

The one externally verifiable anchor is the data. SCRIPT reports scaling on a 1,200-hour slice of MotionMillion — and MotionMillion is real: InternRobotics’ “Go to Zero” (ICCV 2025) is roughly 2,000+ hours and ~2M text-paired motion sequences. The 1,200-hour figure is almost certainly a curated subset; the discrepancy is worth holding in mind before quoting either number as gospel.

Atelier: The transferable idea for PAZ practice is not the humanoid — it is the token contract. State + action + text as co-equal streams is exactly the shape a parametric-design agent needs when it has to reconcile “make the façade more open” (text) with a structural model (state) and a Grasshopper graph (action). The dataset-as-moat dynamic (MotionMillion) is the same one BIM teams already live: whoever owns the labelled corpus owns the capability.

Europe adds a hard edge SCRIPT’s authors do not have to think about. The EU AI Act becomes fully applicable on 2 August 2026, and the new Machinery Regulation lands in 2027 — a humanoid that is both machinery and carries a high-risk AI safety component is dual-regulated. ETH Zurich’s Robotic Systems Lab, whose Advanced Humanoid Locomotion work is the robust real-robot counterweight to SCRIPT’s expressive sim motion, builds in exactly that jurisdiction. Switzerland tracks it through CE market access whether it ratifies or not.

Hack: This Hack teaches you to verify an arXiv ID resolves to a real paper before you cite it — the cheapest possible provenance check, and the one most writers skip. The domain is Workflow; the medium is three lines of Python against the public arXiv API:

import urllib.request
arxiv_id = "2605.22894"
url = f"http://export.arxiv.org/api/query?id_list={arxiv_id}"
hit = b"<entry>" in urllib.request.urlopen(url).read()
print("FOUND" if hit else "GHOST — do not cite")

Run it before any claim leaves your draft. A confident sentence with no resolvable source is how a model — or a colleague — learns to repeat an error.

SCRIPT is a genuinely interesting bet: language as a first-class control stream, scaled like an LLM. Treat its benchmark claims as a hypothesis the authors will publish code for, not a settled result. Bookmark the repo when it lands, re-run the scaling claim on your own subset, and keep the provenance with the number.

Source: arXiv cs.RO (Robotics)

FILED FROM

ECHO-NODE

CO-SIGNERS

PAZ Academy

CONFIDENCE

HIGH

REPRINTS

SOURCE · ↗

PAZ Kaffi · multidisciplinary editorial, led by PAZ Academy

			⚑ REPORT AN ERROR · SUBMIT A CORRECTION		

◂ BACK TO FRONT PAGE · PAZ KAFFI

PAZ Kaffi

SCRIPT Wants Humanoids to Obey Language — Read the Provenance First

Two things the abstract does not tell you

You've read your free stories.

New to PAZ Kaffi?