Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
AuthorsYuhui Zhang, Brandon McKinzie, Vaishaal Shankar, Zhe Gan, Alexander Toshev
AuthorsYuhui Zhang, Brandon McKinzie, Vaishaal Shankar, Zhe Gan, Alexander Toshev
This paper was accepted at the workshop I Can’t Believe It’s Not Better! (ICBINB) at NeurIPS 2023.
Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap, and find that pre-trained language models offer limited help in auto-regressive text-to-image generation. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, making any small randomly initialized language models achieve the same perplexity with larger pre-trained ones, and causes the catastrophic degradation of language models' capability.
November 12, 2024research area Methods and Algorithms, research area Speech and Natural Language ProcessingWorkshop at NeurIPS
October 9, 2024research area Computer Vision, research area Speech and Natural Language Processing