This week, the Beijing Academy of Artificial Intelligence unveiled a self-developed multimodal world model named Emu3, which achieves a unified understanding and generation of video, images and text.
Emu3 successfully validates that next-token prediction can serve as a powerful paradigm for multimodal models, scaling beyond language models and delivering state-of-the-art performance across multimodal tasks. In simple terms, it shows that predicting the next word or element in a sequence can be useful for models that handle both text and images, not just text alone.
Emu3 focuses on predicting the next part of a sequence, removing the necessity for complex methods like diffusion or composition. It converts images, text, and videos into a common format, teaching a single transformer model from the beginning on a mix of different types of sequences containing both text and images.
According to the academy, it has open-sourced Emu3's key technologies and models to the international tech community. Industry experts have expressed that for researchers, Emu3 signifies a new opportunity to explore multimodality through a unified architecture without the need to combine complex diffused models with large language models.
Wang Zhongyuan, director of the academy, said Emu3 has demonstrated high performance in multimodal tasks through next-token prediction, paving the way for the development of multimodal AGI.
"Emu3 has the potential to converge infrastructure development onto a single technical path, laying the foundation for large-scale multimodal training and inference," he said. "This simple architectural design will facilitate industrialization. In the future, multimodal world models will drive applications in scenarios such as robotic cognition, autonomous driving, multimodal conversations and reasoning."