Poster
Making AutoEncoders Diffusable Again
Ivan Skorokhodov · Sharath Girish · Benran Hu · Willi Menapace · Yanyu Li · Rameen Abdal · Sergey Tulyakov · Aliaksandr Siarohin
[
Abstract
]
Thu 17 Jul 11 a.m. PDT
— 1:30 p.m. PDT
Abstract:
Latent Diffusion Models (LDMs) have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational cost of diffusion processes. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we conduct a spectral analysis of latent spaces in widely used autoencoders and identify a prominent high-frequency component that deviates from natural RGB signals — an effect that becomes even more pronounced in recent autoencoders with a large number of channels. We hypothesize that this high-frequency component reduces the efficiency of the diffusion process and increases generation complexity. To address this issue, we propose a simple yet effective spectral regularization strategy that aligns latent space representations with RGB signals across frequencies by enforcing scale equivariance in the decoder. Our method requires minimal modifications yet significantly improves generation quality, reducing FID by 19\% for image generation on ImageNet-1K $256^2$ and FVD by at least $44$\% for video generation on Kinetics-700 $17 \times 256^2$ after up to $20,000$ autoencoder fine-tuning steps.
Live content is unavailable. Log in and register to view live content