Discrete Stochastic Localization

Abstract

Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce Discrete Stochastic Localization (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from T=128 to T=1024, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps—without distillation or retraining.

Noisy tokens localize from the mask center toward token anchors.

State Space Intuition

From [MASK] to soft token evidence to hard tokens. DSL makes token states continuous: zero information decodes as [MASK], finite SNR carries soft evidence, and high SNR localizes near hard token anchors.

Zero information

At zero SNR, the noisy token carries no token identity and is decoded as [MASK].

Soft evidence

At finite SNR, the vector points toward a region of the token embedding table rather than a single token.

Localization

As SNR increases, posterior mass concentrates near one token anchor and the state becomes nearly clean.

Per-token SNR

Different positions can occupy different confidence levels, so sampling becomes path design in SNR space.

DSL state space illustration with token anchors A B C D E

Noisy tokens move from the [MASK] center through soft evidence before localizing near token anchors.

Unified Perspective

Sampling algorithms are different paths in per-token SNR space. Once DSL defines a per-token SNR state space, a sampling algorithm becomes a path through that space. Different algorithms simply choose different ways to move token positions from zero information to clean tokens.

Token-level generation order:

Each row shows how information enters token positions. AR and masked diffusion jump to endpoint tokens; continuous diffusion gradually fills all positions; DSL allows different finite-SNR levels per token.

Autoregressive

Iamacat

Masked diffusion

Iamacat

Continuous dLLM

Iamacat

DSL

Iamacat

Paths in $(\gamma_1,\gamma_2)$

Method	State support	Information path	What DSL changes
AR	clean prefix + unknown future	one token moves at a time in sequential order	endpoint reveal becomes one query path
Random-order AR	clean subset + unknown rest	reveal coordinates in arbitrary order	supported without a sampler-specific model
Masked dLLM	[MASK] / clean endpoints	reveal subsets; ReMDM can reopen tokens	endpoint paths use the same posterior estimator
Continuous diffusion	finite-SNR noisy tokens everywhere	jointly increase SNR in a sentence	finite-SNR states become part of the same token-posterior space
DSL	per-token SNR noisy tokens	arbitrary path through a shared posterior field	one time-invariant denoiser serves many paths

DSL Method

Training workflow

Stage 0

Unit-norm embedding

Every token embedding lives on the unit hypersphere. The norm equality is the key structural constraint.

Stage 1

Stochastic localization

Sample noisy tokens with $z_t=t x_0+\sqrt{t}\epsilon$. Higher SNR means more token information.

Stage 2

Converter

Convert each noisy token into a soft combination of token embeddings using similarities to the embedding table.

Stage 3

Denoiser

Train a Diffusion Transformer (DiT) denoiser $p_\theta(s_i\mid z)$.

Show technical details

4.1 Three-step overview

Unit-norm token table

Map every token $v$ to an embedding $x_v$ on the unit sphere:

$$\|x_v\|_2=1.$$

This is not only a visualization choice. It is the geometric constraint that makes the posterior simplify under the localization channel.

When all clean token embeddings have the same norm, the nominal-SNR penalty term becomes constant across token candidates and cancels out of the posterior.

Stochastic localization

Given a clean token embedding $x_i$, DSL corrupts it through the localization channel

$$z_i=\gamma_i x_i+\sqrt{\gamma_i}\epsilon_i,\qquad \epsilon_i\sim\mathcal N(0,I).$$

The SNR is $\gamma_i$. At $\gamma_i=0$, the token carries no information and behaves like [MASK]. At finite SNR, it carries soft token evidence. At high SNR, it localizes near a clean token anchor.

This gives continuous token states a discrete posterior interpretation.

Time-invariant denoising

DSL trains a denoiser to estimate the clean-token posterior from the noisy state $z$.

Because the noisy token is represented in posterior coordinates, the Bayes-optimal denoiser depends on the noisy token state, not on the nominal SNR that produced it.

This is the structural reason one model can support masked reveal paths, random-order AR paths, continuous denoising paths, and hybrid paths.

4.2 Why the noisy token is enough

The cancellation that makes DSL work

Under the localization channel, the clean-token posterior has the form

$$p(x\mid z,\gamma)\propto p(x)\exp\left(z\cdot x-\frac{\gamma}{2}\|x\|^2\right).$$

The only term that explicitly contains the nominal SNR $\gamma$ is

$$-\frac{\gamma}{2}\|x\|^2.$$

If all token embeddings lie on the unit sphere, then $\|x\|^2$ is constant across candidate tokens. After posterior normalization, this term cancels:

$$p(x\mid z)\propto p(x)\exp(z\cdot x).$$

So the Bayes-optimal posterior mean becomes

$$\hat{x}(z)=\frac{\mathbb E_{P(x)}[x\exp(x\cdot z)]}{\mathbb E_{P(x)}[\exp(x\cdot z)]}.$$

The denoiser no longer needs to know which nominal SNR generated $z$. It only needs the noisy token itself.

DSL needs both ingredients: stochastic localization makes $z$ a posterior natural coordinate, and unit-norm geometry makes the SNR-dependent norm term constant. Either ingredient alone is not enough.

4.3 Arbitrary per-token SNR paths

A global diffusion clock is no longer required

Once the denoiser is time-invariant, different token positions do not need to share the same SNR.

DSL can assign every position its own SNR $\gamma_i$:

$$z_i=\gamma_i x_i+\sqrt{\gamma_i}\epsilon_i.$$

Each $\gamma_i$ can follow its own path. Some positions may remain masked, some may become clean, and some may sit in finite-SNR soft-evidence states.

This turns generation into path design.

Path interpretation

Driving one position from $\gamma=0$ to $\gamma=\infty$ while holding others fixed gives a random-order autoregressive reveal step.

Sending a visible token back to $\gamma=0$ before denoising gives masked remasking.

Increasing all $\gamma_i$ jointly gives continuous denoising.

Combining finite-SNR motion with endpoint decoding gives hybrid sampling.

The paper states this path interpretation directly: random-order AR generation, masked refinement, and continuous diffusion sampling are different paths through one shared per-token SNR configuration space.

4.4 Exact NLL views motivate the training support

The training states are not chosen as a sampler heuristic

DSL’s training support comes from two exact likelihood views of the same per-token SNR formalism.

Continuous path-integral NLL

Finite-SNR denoising states arise from an exact path-integral identity:

$$-\log P(x)=\frac{1}{2}\int_C E(x,\gamma)\cdot d\gamma,$$

where

$$E_i(x,\gamma)=\mathbb E_{p_\gamma(z\mid x)}\left[\|x_i-\hat{x}_i(z)\|_2^2\right].$$

Under the Bayes-optimal denoiser, the error field is conservative, so the integral is path-independent. Intermediate continuous-SNR states are therefore not just heuristic interpolants; they are integration variables in an exact likelihood view.

Endpoint ROAR NLL

Endpoint mask/reveal states give a second exact view. When some positions are fully revealed and the rest are fully masked, the likelihood reduces to a random-order autoregressive estimator:

$$-\frac{1}{L}\log P(s)=-\mathbb E_{k,A,i}\log P(s_i\mid s_A).$$

This recovers the subset-conditioned form used by masked-diffusion language models, but here it arises from the same DSL path formalism.

Bridge. The continuous path-integral view contributes finite-SNR states. The endpoint ROAR view contributes mask/reveal states. Mixed-SNR training simply trains one posterior estimator on both supports.

4.5 Training recipe

Mixed-SNR support, one cross-entropy objective

DSL trains on a mixture of two state families:

Endpoint states, where tokens are near [MASK] or near clean anchors.

Finite-SNR states, where tokens carry soft posterior evidence.

Both branches use one token-level cross-entropy objective:

$$\mathcal L_{\mathrm{CE}}=-\mathbb E_{(s,z)\sim q_{\mathrm{train}}}\sum_{i=1}^{L}\log p_\theta(s_i\mid z).$$

For the endpoint branch, this is the natural categorical likelihood. For the continuous branch, cross-entropy acts as a posterior-matching surrogate: if the model learns the true token posterior, its posterior mean recovers the Bayes denoiser in embedding space.

Implementation note. In practice, DSL samples a training example from one of two branches. The ROAR branch samples a revealed subset and assigns endpoint SNRs. The continuous branch samples per-token SNRs from a continuous distribution. Both are passed through the same converter and backbone.

4.6 Architecture

Posterior-view converter

The backbone does not receive raw noisy embeddings directly. Each noisy token is converted into a soft mixture over the token table:

$$q_i^{\mathrm{conv}}(v\mid z_i)=\mathrm{softmax}\left(\beta\langle z_i,x_v\rangle+b_v\right),\qquad m_i^{\mathrm{conv}}=\sum_v q_i^{\mathrm{conv}}(v\mid z_i)x_v.$$

The converter makes every noisy token look like a posterior object:

Near zero SNR, the mixture is broad.

At finite SNR, it carries soft evidence.

At high SNR, it concentrates near a clean token anchor.

The converter scale $\beta$ and bias $b_v$ are important parameters: they control how similarity to the unit-norm token table is turned into a soft token distribution. This also gives the backbone a bounded, token-like input representation.

4.7 Backbone and timestep handling

A standard Transformer/DiT backbone, queried without informative time labels

The sequence of converter outputs is fed to a standard Transformer encoder or DiT-style denoiser. DSL is designed to remain compatible with masked-diffusion backbones, while adding finite-SNR token evidence through the converter and training support.

For compatibility with pretrained masked-diffusion DiTs, the timestep embedding module can remain in place, but DSL feeds a constant value rather than an informative timestep. This matches the theory: the noisy token state already encodes the relevant corruption regime.

The paper makes this precise: DSL keeps the timestep module for compatibility but feeds $t=0$, since $z_i$ and $m_i^{\mathrm{conv}}$ already encode the corruption state.

4.8 Inference

Choose a path, not a new model

At inference time, DSL keeps the posterior estimator fixed and changes only the path through SNR space.

Masked refinement

I

am

a

cat

Moves positions between mask and clean endpoint states, with optional remasking for error correction.

ROAR

I

am

a

cat

Reveals one position at a time in a random order.

Hybrid

I

am

a

cat

Starts with continuous soft evidence and then switches to endpoint MDM-style decoding.

All of these paths query the same clean-token posterior estimator.

Key Findings

We answer four questions. All DSL rows below use the same DSL-finetuned checkpoint. The point is not one isolated sampler, but whether multiple paths can exploit the same denoiser.

RQ1 · Training

Does DSL improve the denoiser before adding strong samplers?

Using simple non-refinement MDM decoders isolates the effect of DSL training itself. Replacing the MDLM checkpoint with a DSL-finetuned checkpoint under the same vanilla MDLM sampler raises MAUVE from 0.042 to 0.495 at T=1024; naive random-order AR reaches 0.551.

RQ2 · Remasking

Can ReMDM-family samplers exploit the same posterior?

Yes. ReMDM-family remasking improves the same checkpoint further, reaching 0.707 MAUVE at T=512 with confidence-based remasking and 0.722 at T=1024 with ReMDM-loop.

RQ3 · Hybrid

Can continuous and endpoint stages compose?

The hybrid continuous-then-MDM sampler reaches 0.662 MAUVE with 48 total steps and 0.724 with 80 total steps, showing that one checkpoint can compose soft finite-SNR evidence and endpoint decoding.

RQ4 · NLL

Does DSL remain competitive on likelihood?

On Text8, DSL reports ≤1.45 BPC, improving over the continuous Plaid baseline and approaching the masked-discrete-diffusion range. Continuous estimates are ELBO upper bounds, so this comparison should be read cautiously.

Results

OWT unconditional generation · MAUVE ↑

Method	T=128	T=256	T=512	T=1024
MDLM	0.015	0.023	0.031	0.042
ReMDM	0.057	0.216	0.350	0.403
ADLM	0.140	0.349	0.573	0.699
DSL-FT + ROAR (naive)	—	—	—	0.551
DSL-FT + MDLM	0.402	0.481	0.506	0.495
DSL-FT + ReMDM (confidence)	0.615	0.622	0.707	0.610
DSL-FT + ReMDM-loop	0.639	0.661	0.651	0.722

DSL-FT improves OWT MAUVE across step budgets, with ReMDM-loop strongest at T=1024.

Hybrid continuous-then-MDM sampler · MAUVE ↑

Tcont	TMDM	NFE	MAUVE
16	16	32	0.501
16	32	48	0.662
16	48	64	0.702
32	16	48	0.587
32	32	64	0.589
32	48	80	0.724

The hybrid sampler composes soft finite-SNR evidence with endpoint commitment in few steps.

Text8 likelihood · BPC ↓

Category	Method	BPC
Continuous diffusion	Plaid	≤1.48
Continuous diffusion	DSL	≤1.45
Discrete diffusion	MDLM	≤1.40
Discrete diffusion	MD4	≤1.37
Autoregressive	Transformer AR	1.23

DSL improves the continuous-state Text8 BPC result reported on this page.

Analysis

The analysis focuses on the geometry induced by unit-norm tokens and how DSL guarantees a robust denoiser: the 2D toy shows that even when a noisy draft is locally inconsistent, the posterior field can still pull it back toward the correct clean direction.

Direction = token identity; scale controls concentration

For a fixed token position, DSL induces

$$p(v\mid z)\propto \pi_v\exp(\langle z,x_v angle)=\pi_v\exp(\|z\|\cos heta_v).$$

The direction of the noisy token selects a token identity region. The radial scale increases posterior concentration; the converter temperature $\beta$ and bias $b_v$ also modulate the final concentration seen by the backbone.

Interactive posterior concentration

Move the slider: larger effective $\beta\|z\|$ shrinks probability mass toward the token anchor pointed to by $z$.

Robust posterior correction toy

In a two-bit toy distribution supported only on $(0,0)$ and $(1,1)$, the posterior decision boundary depends on the joint statistic $z_1+z_2$. A locally inconsistent noisy-token pair can still be pulled toward the valid solution if it stays on the correct side of that posterior boundary.

More analysis in the upcoming blog.

Takeaways

Five things to remember.

Generation is path design. random order AR, masked diffusion, and continuous diffusion differ in how information enters token positions.
DSL defines a continuous state space for discrete tokens. [MASK], soft token evidence, and hard token anchors live in one geometry.
The math trick is posterior-coordinate localization. Stochastic localization plus unit-norm token geometry makes the noisy token sufficient for the clean-token posterior.
One denoiser can serve many paths. The same checkpoint supports masked diffusion, ReMDM-family remasking, random-order AR, and hybrid sampling.
Geometry makes uncertainty representable. Direction carries token identity; scale and converter parameters control posterior concentration.

BibTeX

@article{wu2026dsl,
  title={Discrete Stochastic Localization for Non-autoregressive Generation},
  author={Wu, Yunshu and Cheng, Jiayi and Yu, Longxuan and Thakuria, Partha and Brekelmans, Rob and Papalexakis, Evangelos E. and Ver Steeg, Greg},
  journal={arXiv preprint arXiv:2602.16169},
  year={2026}
}