Alibaba has announced the launch of Wan2.2, the industry’s first open-source large-scale video generation models built on the Mixture-of-Experts (MoE) architecture. This latest release marks a significant leap forward in AI-driven video creation, enabling developers and content creators to generate cinematic-quality videos with just a single click.
The Wan2.2 suite includes three models: Wan2.2-T2V-A14B, a text-to-video model; Wan2.2-I2V-A14B, designed for image-to-video generation; and Wan2.2-TI2V-5B, a hybrid model that integrates both text and image input capabilities within a unified framework.
Leveraging the MoE architecture and trained on high-quality, aesthetically curated data, Wan2.2-T2V-A14B and Wan2.2-I2V-A14B are capable of producing visually stunning outputs. These models allow users to fine-tune details such as lighting, time of day, color tone, camera angles, frame size, composition, and focal length, providing creators with granular control over visual storytelling.
Both MoE models demonstrate substantial improvements in motion representation, effectively capturing nuanced facial expressions, fluid hand gestures, and complex sports actions. They also exhibit improved instruction-following capabilities and produce outputs that align closely with physical realism.
Also Read: Vellum Raises $20M to Boost Rigor, Speed & Reliability in AI
To address the challenge of high computational demands in video generation due to long token sequences, the models incorporate a two-expert design in the denoising phase of diffusion models. This setup includes a high-noise expert to define overall scene layout and a low-noise expert for refining textures and finer details. Although each model contains 27 billion parameters, only 14 billion are activated per generation step, resulting in up to 50% reduction in computational load.
Wan2.2 introduces fine-grained aesthetic customization through a cinematic prompt system that segments creative dimensions like lighting, composition, and color tone. This feature enables the models to precisely interpret and execute a user’s artistic vision during the generation process.
The training dataset for Wan2.2 reflects a substantial upgrade over its predecessor, Wan2.1, with a 65.6% increase in image data and an 83.2% increase in video data. These enhancements allow Wan2.2 to generate more complex scenes and motions while significantly expanding its capacity for artistic expression and visual creativity.
Efficiency and Scalability with a Compact Hybrid Model
A standout addition to the Wan2.2 lineup is Wan2.2-TI2V-5B, a dense hybrid model optimized for efficiency. It incorporates a high-compression 3D VAE architecture that achieves a 4x16x16 compression ratio across temporal and spatial dimensions. This results in a total compression rate of 64, enabling the generation of 5-second 720P videos within minutes on a single consumer-grade GPU ideal for scalable and resource-efficient deployments.
Developers and researchers can access the Wan2.2 models via Hugging Face, GitHub, and Alibaba Cloud’s open-source platform, ModelScope. Alibaba has been a key contributor to the global open-source community, previously open-sourcing four Wan2.1 models in February 2025 and Wan2.1-VACE (Video All-in-one Creation and Editing) in May 2025. Collectively, these models have already surpassed 5.4 million downloads across Hugging Face and ModelScope.