RefDecoder: Conditional Video Decoding

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Sophie Weber

May 14, 2026

|8 Min Read

Pixazo AI (FLUX.1)|pixazo

Photo by Pixazo AI (FLUX.1) on pixazo

Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ…

Reporting by Xiang Fan, SwissFinanceAI Redaktion

ai-toolsnewsresearch

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.

Source

Original Article: RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Published: May 14, 2026

Author: Xiang Fan

Disclaimer: This article is for informational purposes only and does not constitute financial advice. Consult a licensed financial advisor before making investment decisions.

References

[1]NewsCredibility: 9/10

ArXiv AI Papers. "RefDecoder: Enhancing Visual Generation with Conditional Video Decoding." May 14, 2026.

https://arxiv.org/abs/2605.15196v1

Transparency Notice: This article may contain AI-assisted content. All citations link to verified sources. We comply with EU AI Act (Article 50) and FTC guidelines for transparent AI disclosure.

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Source

References

blog.relatedArticles

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

Turning AI cost spikes into strategic growth opportunities

The enterprise risk nobody is modeling: AI is replacing the very experts it needs to learn from