SkyReels-v4

SkyReels-v4

Unified Multimodal Video + Audio Foundation Model

SkyReels-v4 jointly generates, inpaints, and edits video with synchronized audio in one model. It combines dual-stream MMDiT with shared MMLM text understanding and supports text, image, video, mask, and audio references.

Starting from the 13th

What makes SkyReels-V4 fundamentally different

Unified Multimodal Core

One Interface for Generation, Inpainting, and Editing

A channel-concatenation formulation unifies image-to-video, video extension, video editing, and reference-driven inpainting under one workflow. Complex visual conditioning is handled consistently through multimodal prompts.

Dual-Stream Architecture

Joint Video-Audio Synthesis with Shared Language Understanding

Video and audio are produced by two temporally aligned MMDiT branches that share a powerful MMLM-based text encoder. The video branch uses in-context learning for fine-grained visual control, while the audio branch uses audio references for guided sound generation.

Cinematic Scale + Efficiency

1080p, 32 FPS, 15s with Practical Compute

SkyReels-v4 supports up to 1080p, 32 FPS, and 15-second duration for multi-shot cinematic output with synchronized audio. Efficiency comes from joint low-resolution full-sequence generation plus high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models.

In Practice

How SkyReels-v4 improves real production workflows

Based on creator feedback and technical notes, SkyReels-v4 reduces post-production friction by generating synchronized video and audio in one pass, while still supporting advanced multimodal control.

Production Bottleneck

No More "Silent Clip First, Audio Later"

Traditional tools often output silent visuals first, forcing separate sound design and manual synchronization. V4 generates audiovisual clips together to cut this extra post step.

Unified Model

Dual-Stream MMDiT with Shared Text Understanding

A video branch and audio branch run in parallel with one shared multimodal text encoder, so motion and sound remain temporally aligned from the same semantic intent.

Input Flexibility

Five Input Modalities in One Interface

SkyReels-v4 supports text prompts, reference images, video clips, masks, and audio references for generation, inpainting, and targeted edits within a consistent workflow.

Current Rollout

1080p / 32 FPS / 15s Today, Expansion Next

The current target output is up to 1080p at 32 FPS for 15-second clips with synced audio. Access is rolling out in stages, with broader API availability expected after preview.

Creation Modes

Reference, extension, restyling, and text editing in one flow

SkyReels-v4 supports reference-driven generation, continuation of existing clips, stylistic transformation, and precise text-based edits with consistent visual identity.

Reference Consistency

Keep subject identity stable under new directions

Guide composition and motion with textual instructions while maintaining the key visual identity learned from references.

Video Extension

Continue clips without breaking motion logic

Extend existing videos while preserving subject appearance, scene coherence, and temporal continuity from the original sequence.

Video Restyling

Transform artistic look while preserving structure

Reimagine videos with bold new visual styles, keeping core motion and scene layout while changing the final artistic identity.

Text Editing

Apply direct textual edits to visual outcomes

Refine details with natural-language instructions to produce cleaner composition, stronger detail, and more consistent style behavior.

Gallery

See what SkyReels-V4 can create

Prompt our models to create breathtaking videos with rich details. Discover the art of the possible.

A vibrant cyberpunk city street at night, neon lights reflecting off puddles on the ground. People walking with umbrellas. Cinematic lighting, photorealistic, 4k.

A majestic eagle soaring over snow-capped mountains, dramatic golden hour lighting, cinematic drone shot circling the subject, high details, 8k resolution.

Detailed macro shot of a coffee being poured into a ceramic cup, slow motion, coffee splashes forming perfect droplets, warm cozy cafe atmosphere in the background.

An astronaut floating in a highly detailed futuristic space station looking through a large window at Earth. Cinematic lighting, volumetric scattering, hyper-realistic.

Frequently Asked Questions

Everything you need to know about SkyReels-v4's multimodal video and audio workflow.

Ready to build with SkyReels-v4?

Build video + audio generation, inpainting, and editing with one unified multimodal model.

Starting from the 13th