Structured Label Smoothing
over Finite Scalar Quantization
for Discrete World Models

Florent Tariolle

INSA Rouen Normandy

Correspondence to florent.tariolle@insa-rouen.fr

Pre-freeze research. Model freeze targeted for 2026-05-31, NeurIPS 2026 workshop submission. This page describes the method and architecture; numerical results and ablation tables will land with the final model.

Abstract

Discrete World Models tokenize observations with a learned quantizer and predict next-frame tokens with a transformer, but standard cross-entropy treats every incorrect prediction as equally wrong. Finite Scalar Quantization (FSQ) makes a richer signal available by construction: each code sits on an integer coordinate lattice, so a token one step away in one dimension is a near-miss while a token at the opposite corner is a gross error.

We introduce Structured Label Smoothing (SLS), which replaces the one-hot training target with a kernel over codebook coordinates, so a near-miss prediction is treated as a near-miss rather than a gross error. An isotropic kernel with bandwidth fixed by a first-neighbour rule gives a zero-calibration hyperparameter that is robust to codebook drift.

We integrate SLS into a complete Vision-Model-Controller (V-M-C) pipeline for Geometry Dash. The FSQ-VAE tokenizes Sobel edge maps, a causal transformer predicts next-frame tokens, and a CNN actor-critic on the predicted token grid drives the agent. The controller is trained entirely in imagination on dreamed rollouts and deployed at 30 FPS on the real game via screen capture.

TL;DR. A training objective that exploits the FSQ codebook lattice so near-miss predictions are treated as near-misses rather than gross errors, integrated into a real-time V-M-C pipeline that runs at 30 FPS on Geometry Dash.

Structured Label Smoothing

One-hot cross-entropy ignores codebook geometry entirely. SLS replaces the one-hot training target with a kernel over FSQ lattice coordinates, so tokens one quantization step away from the target receive most of the supervision mass and distant tokens receive negligible weight. Concretely, given the target token $i$ and an alternative $j$ with FSQ coordinates $\mathbf{c}(i), \mathbf{c}(j)$ on the integer lattice:

$$q_{\text{SLS}}(j \mid i) \;\propto\; k\!\left(\lVert \mathbf{c}(j) - \mathbf{c}(i) \rVert\right), \qquad k(d) = e^{-d^{2}/2\sigma^{2}}.$$

The bandwidth $\sigma$ is fixed by a first-neighbour rule (immediate lattice neighbour at the kernel's half-maximum), giving a zero-calibration hyperparameter that is robust to codebook drift. Classical uniform label smoothing is the structureless limit of this construction.

SLS kernel visualization: 2D slice and sorted distribution vs. uniform — **(a)** 2D slice of SLS weights over two FSQ lattice dimensions with the remaining two fixed at the target's values; the red square marks the target token and weights decay with lattice distance. **(b)** All V−1 alternative tokens sorted by SLS weight (log scale) versus uniform label smoothing. SLS concentrates mass on near-miss tokens while suppressing distant ones by orders of magnitude.

Method highlights

FSQ tokenization for sharp dreams. An FSQ-VAE with levels [8, 5, 5, 5] (1000 implicit codes) encodes Sobel edge maps into an 8×8 grid of discrete tokens. No codebook collapse, no commitment loss, 100% utilization by construction.
Discrete dynamics transformer. A causal transformer (384d, 8 heads, 8 layers) predicts the next frame's tokens given a context window of past frames and actions. Actions are interleaved as tokens in the sequence; the target frame is replaced by a learnable mask embedding so the model never peeks at the ground truth. 3D-RoPE positional encoding factorises row, column, and frame indices, and block-causal attention is bidirectional within a frame block and causal across frames.
Action-conditional contrastive coding. A TWISTER-style AC-CPC auxiliary loss predicts future hidden states from past states and actions in latent space, sharpening the temporal coherence of dreams without a second decoder.
Zero-calibration kernel. The SLS bandwidth is set by a first-neighbour rule on the FSQ lattice, so no per-dataset sensitivity probe is needed and the hyperparameter has a single principled choice independent of codebook size.
CNN actor-critic in latent space. A small CNN reads the predicted token grid alongside the transformer's hidden state and emits a jump probability and value estimate. BC pretraining on expert demonstrations warm-starts the policy, which is then fine-tuned with PPO entirely on dreamed rollouts.
Real-time deployment. The full V-M-C pipeline runs at a fixed 30 FPS cadence on consumer hardware. Within each frame, pinned-memory non-blocking H2D transfers let the CPU queue GPU work without stalling on the upload, a GPU-resident token ring buffer eliminates round-trips, the encoder is wrapped in a CUDA Graph after warm-up, and the model graph is compiled with torch.compile in reduce-overhead mode.

Structured Label Smoothing
over Finite Scalar Quantization
for Discrete World Models

Abstract

Live deployment

Pipeline overview

Structured Label Smoothing

Method highlights

Structured Label Smoothing over Finite Scalar Quantization for Discrete World Models

Abstract

Live deployment

Pipeline overview

Structured Label Smoothing

Method highlights

Structured Label Smoothing
over Finite Scalar Quantization
for Discrete World Models