Diffusion models have demonstrated state-of-the-art performance in generating high-quality images and videos. However, due to computational and optimization challenges, learning diffusion models in high-dimensional spaces remains a formidable task. Existing methods often resort to training cascaded models, where a low-resolution model is linked with one or several upscaling modules. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. Instead of training separate models, we propose a multi-scale joint diffusion process, where smaller-scale models are nested within larger scales. This nesting structure not only facilitates feature sharing across scales but also enables the progressive growth of the learned architecture, leading to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including standard datasets like ImageNet, as well as high-resolution text-to-image and text-to-video applications. For instance, we achieve xx FID on ImageNet and xx FID on COCO. Notably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels with three nested scales.

Related readings and updates.

Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR…
Read more
This paper was accepted at the Foundation Models in the Wild workshop at ICML 2024. Diffusion models have become the de-facto approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as alignment to a text description, which can be specified with a black-box reward function. Prior works fine-tune pretrained diffusion…
Read more