No More "Silent Clip First, Audio Later"
Traditional tools often output silent visuals first, forcing separate sound design and manual synchronization. V4 generates audiovisual clips together to cut this extra post step.
SkyReels-v4 jointly generates, inpaints, and edits video with synchronized audio in one model. It combines dual-stream MMDiT with shared MMLM text understanding and supports text, image, video, mask, and audio references.
Starting from the 13thA channel-concatenation formulation unifies image-to-video, video extension, video editing, and reference-driven inpainting under one workflow. Complex visual conditioning is handled consistently through multimodal prompts.
Video and audio are produced by two temporally aligned MMDiT branches that share a powerful MMLM-based text encoder. The video branch uses in-context learning for fine-grained visual control, while the audio branch uses audio references for guided sound generation.
SkyReels-v4 supports up to 1080p, 32 FPS, and 15-second duration for multi-shot cinematic output with synchronized audio. Efficiency comes from joint low-resolution full-sequence generation plus high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models.
Based on creator feedback and technical notes, SkyReels-v4 reduces post-production friction by generating synchronized video and audio in one pass, while still supporting advanced multimodal control.
Traditional tools often output silent visuals first, forcing separate sound design and manual synchronization. V4 generates audiovisual clips together to cut this extra post step.
A video branch and audio branch run in parallel with one shared multimodal text encoder, so motion and sound remain temporally aligned from the same semantic intent.
SkyReels-v4 supports text prompts, reference images, video clips, masks, and audio references for generation, inpainting, and targeted edits within a consistent workflow.
The current target output is up to 1080p at 32 FPS for 15-second clips with synced audio. Access is rolling out in stages, with broader API availability expected after preview.
SkyReels-v4 supports reference-driven generation, continuation of existing clips, stylistic transformation, and precise text-based edits with consistent visual identity.
Use 1 to multiple references plus text prompts to preserve character, object, and background identity while keeping narrative continuity across shots.
Guide composition and motion with textual instructions while maintaining the key visual identity learned from references.
Extend existing videos while preserving subject appearance, scene coherence, and temporal continuity from the original sequence.
Reimagine videos with bold new visual styles, keeping core motion and scene layout while changing the final artistic identity.
Refine details with natural-language instructions to produce cleaner composition, stronger detail, and more consistent style behavior.
Prompt our models to create breathtaking videos with rich details. Discover the art of the possible.
A vibrant cyberpunk city street at night, neon lights reflecting off puddles on the ground. People walking with umbrellas. Cinematic lighting, photorealistic, 4k.
A majestic eagle soaring over snow-capped mountains, dramatic golden hour lighting, cinematic drone shot circling the subject, high details, 8k resolution.
Detailed macro shot of a coffee being poured into a ceramic cup, slow motion, coffee splashes forming perfect droplets, warm cozy cafe atmosphere in the background.
An astronaut floating in a highly detailed futuristic space station looking through a large window at Earth. Cinematic lighting, volumetric scattering, hyper-realistic.
Everything you need to know about SkyReels-v4's multimodal video and audio workflow.
Build video + audio generation, inpainting, and editing with one unified multimodal model.
Starting from the 13th