NeurIPS 2025 Accepted Research

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

A simple and scalable framework that generates high-quality 5-second videos with precise point-level motion control. Create videos where every element moves exactly as you specify through dense point trajectories.

Understanding Wan-Move

Wan-Move represents a significant step forward in motion-controllable video generation. This framework was developed through collaboration between researchers at Tongyi Lab at Alibaba Group, Tsinghua University, the University of Hong Kong, and the Chinese University of Hong Kong. The research paper was accepted at NeurIPS 2025, demonstrating its contribution to the field of computer vision and artificial intelligence.

The fundamental challenge that Wan-Move addresses is motion control in generated videos. While text-to-video models can create impressive results from text descriptions, controlling how specific objects move within those videos has remained difficult. Wan-Move solves this through a method called latent trajectory guidance, which allows users to define precise motion paths for elements in the scene.

At its core, Wan-Move takes an initial image and a set of trajectories as input. The trajectories are defined as dense point paths that specify where parts of the image should move over time. The model then generates a 5-second video at 832×480 pixels resolution where the motion follows these specified paths. This approach provides fine-grained control over motion at the point level, allowing for detailed choreography of scene elements.

The framework is built on top of Wan-I2V-14B, a 14-billion parameter image-to-video foundation model. What makes Wan-Move particularly practical is that it implements motion control as a minimal extension to the existing architecture. This means no specialized motion modules or major architectural changes are required. For researchers already familiar with Wan2.1, most of the existing setup can be reused with minimal migration effort.

Latent Trajectory Guidance Explained

The core innovation in Wan-Move is the latent trajectory guidance technique. This method addresses a fundamental question: how do we convey motion information to a video generation model in a way that provides precise control while integrating naturally with existing architectures?

The solution works by taking features from the first frame of the video and propagating them along user-defined trajectories. Think of it as marking specific points on objects in the initial image, then telling the model exactly where those points should appear in each subsequent frame. The model learns to generate video content that respects these trajectory constraints while maintaining video quality, temporal consistency, and natural motion.

This approach offers several important advantages. First, it requires no modifications to the underlying video generation architecture. The trajectory information is provided as an additional conditioning signal that guides the generation process. Second, it allows for precise control at the point level, meaning individual parts of objects can be controlled independently. Third, it scales naturally to handle multiple objects, each with their own set of trajectory paths.

During training, the model learns from video data paired with trajectory annotations. It learns to associate trajectory patterns with corresponding motion in the video. Once trained, the system can generate new videos where motion follows user-specified trajectories. The generated videos maintain high video quality while accurately following the intended movement patterns.

Technical Specifications

SpecificationDetails
Model NameWan-Move-14B-480P
Research CategoryMotion-Controllable Video Generation
Model Parameters14 Billion
Video Duration5 seconds
Video Resolution832×480 pixels
Foundation ModelWan-I2V-14B
Control MethodDense Point Trajectories
ConferenceNeurIPS 2025
LicenseApache 2.0

Core Capabilities

Wan-Move provides powerful features for precise motion control in video generation

5s

High-Quality 5-Second Videos

Through scaled training, Wan-Move generates 5-second videos at 480p resolution with state-of-the-art motion controllability. User studies demonstrate performance comparable to commercial systems.

LTG

Latent Trajectory Guidance

The core technique represents motion by propagating first-frame features along trajectories. This integrates into existing image-to-video models without architecture changes.

P+

Point-Level Control

Define motion using dense point trajectories for precise control over how each element moves. This enables detailed choreography at the region level.

MB

MoveBench Benchmark

A dedicated evaluation benchmark with diverse content categories, high-quality trajectory annotations, and standardized test cases for fair comparison.

14B

Minimal Extension Design

Built on Wan-I2V-14B with minimal modifications. Users familiar with Wan2.1 can reuse existing setups with low migration cost.

GPU

Multi-GPU Support

Supports FSDP and xDiT USP acceleration for faster inference. Includes memory optimization options for reduced GPU requirements.

MoveBench: Evaluation Benchmark

Alongside the Wan-Move framework, the research team introduced MoveBench, a carefully curated benchmark for evaluating motion-controllable video generation systems. This benchmark addresses the lack of standardized evaluation methods in the field by providing a comprehensive set of test cases with high-quality annotations.

MoveBench includes samples across diverse content categories, featuring both single-object and multi-object scenarios. Each sample contains a reference image, trajectory annotations in NumPy array format, visibility masks indicating when points are occluded, and corresponding text descriptions. The benchmark supports both English and Chinese language options, making it accessible to researchers worldwide.

The construction pipeline for MoveBench involved several steps: careful curation of video content representing various motion types, extraction of trajectory data using motion tracking methods, annotation of visibility information for handling occlusions, and quality verification to ensure accuracy. The result is a reliable benchmark that can evaluate whether generated videos match intended motion patterns.

Researchers can use MoveBench to compare different motion control approaches quantitatively. The benchmark provides metrics for motion accuracy, temporal consistency, and video quality. This standardization helps track progress in the field and enables fair comparison between different methods, both academic and commercial.

Application Scenarios

Wan-Move enables diverse motion control applications for various use cases

🎯

Single-Object Motion

Control the movement of individual objects within a scene. Define the precise path an object should follow, and Wan-Move generates video where that object moves along the specified trajectory while maintaining natural appearance and proper interaction with the environment and lighting.

🎪

Multi-Object Choreography

Choreograph multiple objects simultaneously, each following independent trajectories. This enables complex scenes where different elements move in coordinated or independent patterns, creating dynamic and interesting compositions with multiple moving parts.

📹

Camera Movement Control

Simulate professional camera movements without physical equipment. Support for various camera operations including panning, dollying in and out, linear displacement, and other cinematic movements that create professional-looking results.

🔄

Motion Transfer

Extract motion patterns from one video and apply them to different content. This allows for reusing successful motion templates across different scenes and subjects, enabling consistent motion styles and efficient content creation workflows.

🌐

3D Rotation

Generate videos showing objects rotating in three-dimensional space. Particularly useful for product demonstrations, architectural visualization, and any application requiring 360-degree views of objects or scenes from different angles.

🎬

Content Creation

Create animated content for marketing, education, entertainment, and social media. The precise motion control enables professional-quality animations without traditional animation software, extensive expertise, or lengthy production processes.

Implementation and Setup

Wan-Move is implemented as a minimal extension on top of the Wan2.1 codebase, offering practical benefits for researchers and developers. The model requires Python with PyTorch 2.4.0 or later and can be installed using standard Python package management tools. The installation process is straightforward, with dependencies listed in a requirements file.

Model weights are available through both Hugging Face and ModelScope platforms. The Wan-Move-14B-480P checkpoint contains the trained parameters for generating 5-second videos at 480p resolution. Download tools are provided through command-line interfaces on both platforms, making it easy to obtain the necessary files. The model size is approximately 30GB, so adequate disk space is required.

For inference, the framework supports both single-GPU and multi-GPU configurations. Single-GPU inference works well for generating individual videos or small batches. For larger-scale evaluation or production use, multi-GPU inference with FSDP provides significant speedup. The system includes options to reduce memory usage through model offloading and CPU execution of certain components.

Trajectory data is provided in NumPy array format, with separate files for trajectory coordinates and visibility masks. The trajectory file contains x,y coordinates for each tracked point across all frames. The visibility mask indicates when points are occluded or leave the frame. This straightforward format makes it easy to create custom trajectory data or integrate with existing motion tracking tools.

Performance and Evaluation

User studies comparing Wan-Move with both academic methods and commercial solutions demonstrate competitive motion controllability. The evaluation involved showing generated videos to users and asking them to assess motion accuracy, video quality, and overall realism. Results indicate that Wan-Move achieves performance on par with commercial systems like Kling 1.5 Pro Motion Brush feature.

Compared to other academic approaches in motion-controllable video generation, Wan-Move offers several advantages. The latent trajectory guidance technique requires no specialized architecture components, making it simpler to implement and adapt. The point-level control provides more precision than methods relying on region-based or text-based motion descriptions. The integration with existing models means researchers can build on established foundations rather than starting from scratch.

Qualitative comparisons show that Wan-Move produces videos with accurate motion that follows specified trajectories while maintaining high video quality and temporal consistency. The generated videos exhibit natural motion, proper lighting and shadows, and coherent scene dynamics. Objects move smoothly along their trajectories without artifacts or unnatural jumps.

The release of MoveBench provides a standardized way to measure progress in motion-controllable video generation. By evaluating on this benchmark, researchers can objectively compare different approaches and track improvements over time. This standardization benefits the entire research community by enabling fair and reproducible comparisons.

Research and Development

Wan-Move was developed through collaboration between multiple leading institutions. The research team includes contributors from Tongyi Lab at Alibaba Group, Tsinghua University, the University of Hong Kong, and the Chinese University of Hong Kong. The principal researchers are Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, and Yujiu Yang.

The paper describing Wan-Move was accepted at NeurIPS 2025, the Conference on Neural Information Processing Systems. NeurIPS is one of the premier conferences in machine learning and artificial intelligence, known for rigorous peer review and high-quality research. Acceptance at this conference reflects the significance of the work and its contribution to the field.

The comprehensive release includes not just the model weights but also the complete source code, evaluation benchmark, and detailed documentation. This open approach enables other researchers to reproduce the results, understand the implementation details, build upon the foundation, and advance the field of motion-controllable video generation. The code is available on GitHub under the ali-vilab organization.

Common Questions

Frequently Asked Questions

Find answers to common questions about Wan-Move capabilities, requirements, and usage

Get Started with Wan-Move

Explore the capabilities of motion-controllable video generation through latent trajectory guidance