Structured Label Smoothing
over Finite Scalar Quantization
for Discrete World Models

Florent Tariolle
INSA Rouen Normandy
Correspondence to florent.tariolle@insa-rouen.fr
Pre-freeze research. Model freeze targeted for 2026-05-31, NeurIPS 2026 workshop submission. This page describes the method and architecture; numerical results and ablation tables will land with the final model.

Abstract

Discrete World Models tokenize observations with a learned quantizer and predict next-frame tokens with a transformer, but standard cross-entropy treats every incorrect prediction as equally wrong. Finite Scalar Quantization (FSQ) makes a richer signal available by construction: each code sits on an integer coordinate lattice, so a token one step away in one dimension is a near-miss while a token at the opposite corner is a gross error.

We introduce Structured Label Smoothing (SLS), which replaces the one-hot training target with a kernel over codebook coordinates, so a near-miss prediction is treated as a near-miss rather than a gross error. An isotropic kernel with bandwidth fixed by a first-neighbour rule gives a zero-calibration hyperparameter that is robust to codebook drift.

We integrate SLS into a complete Vision-Model-Controller (V-M-C) pipeline for Geometry Dash. The FSQ-VAE tokenizes Sobel edge maps, a causal transformer predicts next-frame tokens, and a CNN actor-critic on the predicted token grid drives the agent. The controller is trained entirely in imagination on dreamed rollouts and deployed at 30 FPS on the real game via screen capture.

TL;DR. A training objective that exploits the FSQ codebook lattice so near-miss predictions are treated as near-misses rather than gross errors, integrated into a real-time V-M-C pipeline that runs at 30 FPS on Geometry Dash.


Live deployment

Controller trained entirely in imagination, deployed on the real game at 30 FPS via screen capture.

Pipeline overview

Vision-Model-Controller pipeline overview
Vision-Model-Controller pipeline. The FSQ-VAE (V) tokenizes Sobel edge maps into a grid of discrete codes, the causal transformer (M) predicts next-frame tokens with action tokens interleaved in the sequence, and the CNN controller (C) reads the predicted token grid alongside the transformer hidden state to select actions. The action feedback loop enables autoregressive dreaming.

Structured Label Smoothing

One-hot cross-entropy ignores codebook geometry entirely. SLS replaces the one-hot training target with a kernel over FSQ lattice coordinates, so tokens one quantization step away from the target receive most of the supervision mass and distant tokens receive negligible weight. Concretely, given the target token $i$ and an alternative $j$ with FSQ coordinates $\mathbf{c}(i), \mathbf{c}(j)$ on the integer lattice:

$$q_{\text{SLS}}(j \mid i) \;\propto\; k\!\left(\lVert \mathbf{c}(j) - \mathbf{c}(i) \rVert\right), \qquad k(d) = e^{-d^{2}/2\sigma^{2}}.$$

The bandwidth $\sigma$ is fixed by a first-neighbour rule (immediate lattice neighbour at the kernel's half-maximum), giving a zero-calibration hyperparameter that is robust to codebook drift. Classical uniform label smoothing is the structureless limit of this construction.

SLS kernel visualization: 2D slice and sorted distribution vs. uniform
(a) 2D slice of SLS weights over two FSQ lattice dimensions with the remaining two fixed at the target's values; the red square marks the target token and weights decay with lattice distance. (b) All V−1 alternative tokens sorted by SLS weight (log scale) versus uniform label smoothing. SLS concentrates mass on near-miss tokens while suppressing distant ones by orders of magnitude.

Method highlights