Discrete Stochastic Localization
Abstract
Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce Discrete Stochastic Localization (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from T=128 to T=1024, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps—without distillation or retraining.
State Space Intuition
From [MASK] to soft token evidence to hard tokens. DSL makes token states continuous: zero information decodes as [MASK], finite SNR carries soft evidence, and high SNR localizes near hard token anchors.
Zero information
At zero SNR, the noisy token carries no token identity and is decoded as [MASK].
Soft evidence
At finite SNR, the vector points toward a region of the token embedding table rather than a single token.
Localization
As SNR increases, posterior mass concentrates near one token anchor and the state becomes nearly clean.
Per-token SNR
Different positions can occupy different confidence levels, so sampling becomes path design in SNR space.
Noisy tokens move from the [MASK] center through soft evidence before localizing near token anchors.
Unified Perspective
Sampling algorithms are different paths in per-token SNR space. Once DSL defines a per-token SNR state space, a sampling algorithm becomes a path through that space. Different algorithms simply choose different ways to move token positions from zero information to clean tokens.
Token-level generation order:
Each row shows how information enters token positions. AR and masked diffusion jump to endpoint tokens; continuous diffusion gradually fills all positions; DSL allows different finite-SNR levels per token.
Autoregressive
Masked diffusion
Continuous dLLM
DSL
Paths in $(\gamma_1,\gamma_2)$
| Method | State support | Information path | What DSL changes |
|---|---|---|---|
| AR | clean prefix + unknown future | one token moves at a time in sequential order | endpoint reveal becomes one query path |
| Random-order AR | clean subset + unknown rest | reveal coordinates in arbitrary order | supported without a sampler-specific model |
| Masked dLLM | [MASK] / clean endpoints | reveal subsets; ReMDM can reopen tokens | endpoint paths use the same posterior estimator |
| Continuous diffusion | finite-SNR noisy tokens everywhere | jointly increase SNR in a sentence | finite-SNR states become part of the same token-posterior space |
| DSL | per-token SNR noisy tokens | arbitrary path through a shared posterior field | one time-invariant denoiser serves many paths |
DSL Method
Training workflow
Unit-norm embedding
Every token embedding lives on the unit hypersphere. The norm equality is the key structural constraint.
Stochastic localization
Sample noisy tokens with $z_t=t x_0+\sqrt{t}\epsilon$. Higher SNR means more token information.
Converter
Convert each noisy token into a soft combination of token embeddings using similarities to the embedding table.
Denoiser
Train a Diffusion Transformer (DiT) denoiser $p_\theta(s_i\mid z)$.
Show technical details
4.1 Three-step overview
Unit-norm token table
Map every token $v$ to an embedding $x_v$ on the unit sphere:
$$\|x_v\|_2=1.$$
This is not only a visualization choice. It is the geometric constraint that makes the posterior simplify under the localization channel.
When all clean token embeddings have the same norm, the nominal-SNR penalty term becomes constant across token candidates and cancels out of the posterior.
Stochastic localization
Given a clean token embedding $x_i$, DSL corrupts it through the localization channel
$$z_i=\gamma_i x_i+\sqrt{\gamma_i}\epsilon_i,\qquad \epsilon_i\sim\mathcal N(0,I).$$
The SNR is $\gamma_i$. At $\gamma_i=0$, the token carries no information and behaves like [MASK]. At finite SNR, it carries soft token evidence. At high SNR, it localizes near a clean token anchor.
This gives continuous token states a discrete posterior interpretation.
Time-invariant denoising
DSL trains a denoiser to estimate the clean-token posterior from the noisy state $z$.
Because the noisy token is represented in posterior coordinates, the Bayes-optimal denoiser depends on the noisy token state, not on the nominal SNR that produced it.
This is the structural reason one model can support masked reveal paths, random-order AR paths, continuous denoising paths, and hybrid paths.
4.2 Why the noisy token is enough
The cancellation that makes DSL work
Under the localization channel, the clean-token posterior has the form
$$p(x\mid z,\gamma)\propto p(x)\exp\left(z\cdot x-\frac{\gamma}{2}\|x\|^2\right).$$
The only term that explicitly contains the nominal SNR $\gamma$ is
$$-\frac{\gamma}{2}\|x\|^2.$$
If all token embeddings lie on the unit sphere, then $\|x\|^2$ is constant across candidate tokens. After posterior normalization, this term cancels:
$$p(x\mid z)\propto p(x)\exp(z\cdot x).$$
So the Bayes-optimal posterior mean becomes
$$\hat{x}(z)=\frac{\mathbb E_{P(x)}[x\exp(x\cdot z)]}{\mathbb E_{P(x)}[\exp(x\cdot z)]}.$$
The denoiser no longer needs to know which nominal SNR generated $z$. It only needs the noisy token itself.
DSL needs both ingredients: stochastic localization makes $z$ a posterior natural coordinate, and unit-norm geometry makes the SNR-dependent norm term constant. Either ingredient alone is not enough.
4.3 Arbitrary per-token SNR paths
A global diffusion clock is no longer required
Once the denoiser is time-invariant, different token positions do not need to share the same SNR.
DSL can assign every position its own SNR $\gamma_i$:
$$z_i=\gamma_i x_i+\sqrt{\gamma_i}\epsilon_i.$$
Each $\gamma_i$ can follow its own path. Some positions may remain masked, some may become clean, and some may sit in finite-SNR soft-evidence states.
This turns generation into path design.
Path interpretation
Driving one position from $\gamma=0$ to $\gamma=\infty$ while holding others fixed gives a random-order autoregressive reveal step.
Sending a visible token back to $\gamma=0$ before denoising gives masked remasking.
Increasing all $\gamma_i$ jointly gives continuous denoising.
Combining finite-SNR motion with endpoint decoding gives hybrid sampling.
The paper states this path interpretation directly: random-order AR generation, masked refinement, and continuous diffusion sampling are different paths through one shared per-token SNR configuration space.
4.4 Exact NLL views motivate the training support
The training states are not chosen as a sampler heuristic
DSL’s training support comes from two exact likelihood views of the same per-token SNR formalism.
Continuous path-integral NLL
Finite-SNR denoising states arise from an exact path-integral identity:
$$-\log P(x)=\frac{1}{2}\int_C E(x,\gamma)\cdot d\gamma,$$
where
$$E_i(x,\gamma)=\mathbb E_{p_\gamma(z\mid x)}\left[\|x_i-\hat{x}_i(z)\|_2^2\right].$$
Under the Bayes-optimal denoiser, the error field is conservative, so the integral is path-independent. Intermediate continuous-SNR states are therefore not just heuristic interpolants; they are integration variables in an exact likelihood view.
Endpoint ROAR NLL
Endpoint mask/reveal states give a second exact view. When some positions are fully revealed and the rest are fully masked, the likelihood reduces to a random-order autoregressive estimator:
$$-\frac{1}{L}\log P(s)=-\mathbb E_{k,A,i}\log P(s_i\mid s_A).$$
This recovers the subset-conditioned form used by masked-diffusion language models, but here it arises from the same DSL path formalism.
Bridge. The continuous path-integral view contributes finite-SNR states. The endpoint ROAR view contributes mask/reveal states. Mixed-SNR training simply trains one posterior estimator on both supports.
4.5 Training recipe
Mixed-SNR support, one cross-entropy objective
DSL trains on a mixture of two state families:
Endpoint states, where tokens are near [MASK] or near clean anchors.
Finite-SNR states, where tokens carry soft posterior evidence.
Both branches use one token-level cross-entropy objective:
$$\mathcal L_{\mathrm{CE}}=-\mathbb E_{(s,z)\sim q_{\mathrm{train}}}\sum_{i=1}^{L}\log p_\theta(s_i\mid z).$$
For the endpoint branch, this is the natural categorical likelihood. For the continuous branch, cross-entropy acts as a posterior-matching surrogate: if the model learns the true token posterior, its posterior mean recovers the Bayes denoiser in embedding space.
Implementation note. In practice, DSL samples a training example from one of two branches. The ROAR branch samples a revealed subset and assigns endpoint SNRs. The continuous branch samples per-token SNRs from a continuous distribution. Both are passed through the same converter and backbone.
4.6 Architecture
Posterior-view converter
The backbone does not receive raw noisy embeddings directly. Each noisy token is converted into a soft mixture over the token table:
$$q_i^{\mathrm{conv}}(v\mid z_i)=\mathrm{softmax}\left(\beta\langle z_i,x_v\rangle+b_v\right),\qquad m_i^{\mathrm{conv}}=\sum_v q_i^{\mathrm{conv}}(v\mid z_i)x_v.$$
The converter makes every noisy token look like a posterior object:
Near zero SNR, the mixture is broad.
At finite SNR, it carries soft evidence.
At high SNR, it concentrates near a clean token anchor.
The converter scale $\beta$ and bias $b_v$ are important parameters: they control how similarity to the unit-norm token table is turned into a soft token distribution. This also gives the backbone a bounded, token-like input representation.
4.7 Backbone and timestep handling
A standard Transformer/DiT backbone, queried without informative time labels
The sequence of converter outputs is fed to a standard Transformer encoder or DiT-style denoiser. DSL is designed to remain compatible with masked-diffusion backbones, while adding finite-SNR token evidence through the converter and training support.
For compatibility with pretrained masked-diffusion DiTs, the timestep embedding module can remain in place, but DSL feeds a constant value rather than an informative timestep. This matches the theory: the noisy token state already encodes the relevant corruption regime.
The paper makes this precise: DSL keeps the timestep module for compatibility but feeds $t=0$, since $z_i$ and $m_i^{\mathrm{conv}}$ already encode the corruption state.
4.8 Inference
Choose a path, not a new model
At inference time, DSL keeps the posterior estimator fixed and changes only the path through SNR space.
Masked refinement
Moves positions between mask and clean endpoint states, with optional remasking for error correction.
ROAR
Reveals one position at a time in a random order.
Hybrid
Starts with continuous soft evidence and then switches to endpoint MDM-style decoding.
All of these paths query the same clean-token posterior estimator.
Key Findings
We answer four questions. All DSL rows below use the same DSL-finetuned checkpoint. The point is not one isolated sampler, but whether multiple paths can exploit the same denoiser.
Does DSL improve the denoiser before adding strong samplers?
Using simple non-refinement MDM decoders isolates the effect of DSL training itself. Replacing the MDLM checkpoint with a DSL-finetuned checkpoint under the same vanilla MDLM sampler raises MAUVE from 0.042 to 0.495 at T=1024; naive random-order AR reaches 0.551.
Can ReMDM-family samplers exploit the same posterior?
Yes. ReMDM-family remasking improves the same checkpoint further, reaching 0.707 MAUVE at T=512 with confidence-based remasking and 0.722 at T=1024 with ReMDM-loop.
Can continuous and endpoint stages compose?
The hybrid continuous-then-MDM sampler reaches 0.662 MAUVE with 48 total steps and 0.724 with 80 total steps, showing that one checkpoint can compose soft finite-SNR evidence and endpoint decoding.
Does DSL remain competitive on likelihood?
On Text8, DSL reports ≤1.45 BPC, improving over the continuous Plaid baseline and approaching the masked-discrete-diffusion range. Continuous estimates are ELBO upper bounds, so this comparison should be read cautiously.
Results
OWT unconditional generation · MAUVE ↑
| Method | T=128 | T=256 | T=512 | T=1024 |
|---|---|---|---|---|
| MDLM | 0.015 | 0.023 | 0.031 | 0.042 |
| ReMDM | 0.057 | 0.216 | 0.350 | 0.403 |
| ADLM | 0.140 | 0.349 | 0.573 | 0.699 |
| DSL-FT + ROAR (naive) | — | — | — | 0.551 |
| DSL-FT + MDLM | 0.402 | 0.481 | 0.506 | 0.495 |
| DSL-FT + ReMDM (confidence) | 0.615 | 0.622 | 0.707 | 0.610 |
| DSL-FT + ReMDM-loop | 0.639 | 0.661 | 0.651 | 0.722 |
DSL-FT improves OWT MAUVE across step budgets, with ReMDM-loop strongest at T=1024.
Hybrid continuous-then-MDM sampler · MAUVE ↑
| Tcont | TMDM | NFE | MAUVE |
|---|---|---|---|
| 16 | 16 | 32 | 0.501 |
| 16 | 32 | 48 | 0.662 |
| 16 | 48 | 64 | 0.702 |
| 32 | 16 | 48 | 0.587 |
| 32 | 32 | 64 | 0.589 |
| 32 | 48 | 80 | 0.724 |
The hybrid sampler composes soft finite-SNR evidence with endpoint commitment in few steps.
Text8 likelihood · BPC ↓
| Category | Method | BPC |
|---|---|---|
| Continuous diffusion | Plaid | ≤1.48 |
| Continuous diffusion | DSL | ≤1.45 |
| Discrete diffusion | MDLM | ≤1.40 |
| Discrete diffusion | MD4 | ≤1.37 |
| Autoregressive | Transformer AR | 1.23 |
DSL improves the continuous-state Text8 BPC result reported on this page.
Analysis
The analysis focuses on the geometry induced by unit-norm tokens and how DSL guarantees a robust denoiser: the 2D toy shows that even when a noisy draft is locally inconsistent, the posterior field can still pull it back toward the correct clean direction.
Direction = token identity; scale controls concentration
For a fixed token position, DSL induces
The direction of the noisy token selects a token identity region. The radial scale increases posterior concentration; the converter temperature $\beta$ and bias $b_v$ also modulate the final concentration seen by the backbone.
Robust posterior correction toy
In a two-bit toy distribution supported only on $(0,0)$ and $(1,1)$, the posterior decision boundary depends on the joint statistic $z_1+z_2$. A locally inconsistent noisy-token pair can still be pulled toward the valid solution if it stays on the correct side of that posterior boundary.
Takeaways
Five things to remember.
- Generation is path design. random order AR, masked diffusion, and continuous diffusion differ in how information enters token positions.
- DSL defines a continuous state space for discrete tokens. [MASK], soft token evidence, and hard token anchors live in one geometry.
- The math trick is posterior-coordinate localization. Stochastic localization plus unit-norm token geometry makes the noisy token sufficient for the clean-token posterior.
- One denoiser can serve many paths. The same checkpoint supports masked diffusion, ReMDM-family remasking, random-order AR, and hybrid sampling.
- Geometry makes uncertainty representable. Direction carries token identity; scale and converter parameters control posterior concentration.
BibTeX
@article{wu2026dsl,
title={Discrete Stochastic Localization for Non-autoregressive Generation},
author={Wu, Yunshu and Cheng, Jiayi and Yu, Longxuan and Thakuria, Partha and Brekelmans, Rob and Papalexakis, Evangelos E. and Ver Steeg, Greg},
journal={arXiv preprint arXiv:2602.16169},
year={2026}
}