If you have ever generated a near-perfect image in Nano Banana and then wished it would move (pan across the subject, push in for a portrait, rotate the product), you have hit the core gap of 2026 generative media. Image models keep getting sharper. Video models keep getting longer. But the two pipelines still live in different places, and the bridge between them is whatever workflow you build by hand.

This guide walks through one such bridge: generating a still in Google's Nano Banana family (Nano Banana, Nano Banana Pro, or Nano Banana 2) and handing that frame to Skywork AI's SkyReels V4 for image-to-video animation. It is not a benchmark. The framing is: if you wanted to animate an image with AI, here is how the pieces fit together based on each maker's documentation.

TL;DR

Nano Banana family (Google DeepMind) does still images: text-to-image, multi-image blending, character consistency. It does not generate video. (Google blog)
SkyReels V4 (Skywork AI) is a unified video-audio model that accepts a still image as the first frame, optionally an end frame, plus text describing the motion. Up to 1080p, 32 FPS, 15 seconds. (arXiv 2602.21818)
The pipeline is: generate base image → feed to SkyReels V4 as the first frame → describe the motion in text → optionally provide an end frame for a controlled transition.
Match aspect ratios across both stages and stay within SkyReels V4's 15-second duration cap to avoid wasted credits.

Why the gap exists in the first place

Nano Banana 2 (officially gemini-3.1-flash-image-preview) is, per Google's developer docs, an image generation and editing model. It accepts text prompts and reference images, supports up to four character reference images for identity consistency and up to 10 object references, and outputs across fourteen aspect ratios from square to ultrawide. (Google AI for Developers) Nano Banana Pro (gemini-3-pro-image-preview) extends this with native 4K output and explicit support for up to five characters and 14 object inputs per workflow. (Google DeepMind)

Neither produces video. That is by design: the Gemini Image branch focuses on still quality and editing fidelity, while motion lives in a separate stack.

SkyReels V4, released by Skywork AI in February 2026, sits on the other side of that line. According to its arXiv paper, it is "a unified multi-modal video foundation model for joint video-audio generation, inpainting, and editing." It uses a dual-stream Multimodal Diffusion Transformer and explicitly accepts "text, images, video clips, masks, and audio references" as conditioning inputs. (arXiv 2602.21818)

That last detail is the bridge: SkyReels V4 takes images as conditioning input, so a Nano Banana still can become the seed of a SkyReels V4 motion clip.

The workflow at a glance

Stage	Model	Input	Output
1. Base image	Nano Banana / Nano Banana Pro / Nano Banana 2	Text prompt, optional reference images	Still image (1K–4K, fourteen aspect ratios)
2. First frame conditioning	SkyReels V4	The still as `first_frame_image`	(handed to next step)
3. Motion description	SkyReels V4	Text describing camera and subject motion	(handed to next step)
4. Optional end frame	SkyReels V4	A second still as `end_frame_image`	Final video clip up to 15s, 1080p, 32 FPS

The whole thing is four conceptual steps. The hard part is the prompt craft on each side.

Step 1: Generate the base image in Nano Banana

Pick the Nano Banana variant that matches your output need:

Nano Banana 2 (Gemini 3.1 Flash Image): fastest, supports 14 aspect ratios including 1:1, 9:16, 16:9, 21:9, and ratios as extreme as 8:1 or 1:8 for banner work. Up to 4 character reference images and 10 object references in a single prompt. (Google AI for Developers docs)
Nano Banana Pro (Gemini 3 Pro Image): native 4K output, up to 5 characters and 14 object inputs per workflow. Use this when the first frame needs to look gallery-sharp on a 4K screen. (Google DeepMind)
Nano Banana (the original Gemini 2.5 Flash Image): still available, lower per-image cost for iterative drafting.

Two practical tips when generating a still you intend to animate:

Match the aspect ratio to your final video target. If you plan to animate at 16:9, generate the still at 16:9. SkyReels V4 uses the first frame as a literal frame of the video, so ratio mismatch becomes letterboxing or crop loss.
Describe the scene rather than list keywords. Google's own image-generation docs note that "the model's core strength is its deep language understanding." (Google AI for Developers) Scene-level prose gives the next stage more anchor points to describe motion against.

Step 2: Feed the still into SkyReels V4 as the first frame

SkyReels V4's image-to-video task is, per the paper, a special case of the same channel-concatenation formulation it uses for text-to-video, video extension, and editing. The first frame is conditioned, and the rest of the frames are generated from text and (optionally) audio. (arXiv 2602.21818)

Practically, that means the call shape looks something like:

{
  "model": "skyreels-v4",
  "prompt": "<motion description>",
  "first_frame_image": "<URL or base64 of the Nano Banana still>",
  "duration_seconds": 5,
  "resolution": "1080p",
  "fps": 32
}

Parameter names will vary by API host, but the underlying mechanism is the same. The first frame locks the opening composition. From there, the model has the freedom to interpret motion based on the text prompt.

Step 3: Describe the motion, not the scene

This is the step that catches most people. With image prompts you describe what is in the frame. With image-to-video prompts you describe what changes between frames.

Useful patterns:

Camera motion: "slow push-in toward the subject's face," "lateral truck left," "dolly out to reveal the room."
Subject motion: "the woman turns her head and smiles," "petals begin to fall around her," "steam rises from the cup."
Atmospheric motion: "snow falling in soft gusts," "neon signs flicker," "headlights pass through the background."
Pacing modifiers: "subtle," "gentle," "rapid," "in slow motion." SkyReels V4 supports up to 32 FPS at 1080p, so slow-motion specs have headroom. (arXiv 2602.21818)

Avoid re-describing the scene. If the still already shows a red coat and a snowy crosswalk, the motion prompt does not need to repeat them. Save tokens for what actually moves.

Step 4 (optional): End frame for controlled transitions

SkyReels V4's unified framework supports controlling both the start and end frames. You provide a first frame, an end frame, and a motion description, and the model interpolates a coherent path between them. Users report this as the cleanest way to do scene transitions, time-of-day shifts, or before/after product reveals without manual cutting.

A typical pattern:

Generate still A in Nano Banana: "A young woman in a red coat at the crosswalk, daylight."
Generate still B: same character and composition, but "...at night under neon."
Pass A as first_frame_image, B as end_frame_image, and a motion prompt like "the scene transitions from day to night over five seconds, camera holds steady."

Because Nano Banana Pro maintains explicit character consistency across up to five characters in a single workflow, you can keep the subject identical across A and B without identity drift. (Google DeepMind) Without that character lock, the end frame would drift in ways the interpolation cannot fix.

Advanced options inside SkyReels V4

Two capabilities are worth knowing about even if you do not need them in the basic workflow.

Multi-modal references beyond the first frame. The paper lists "text, images, video clips, masks, and audio references" as supported conditioning inputs. (arXiv 2602.21818) Some hosts surface this as an asset-tagging interface where references can be attached to specific subjects in the prompt, often using an @reference style notation. The exact UX varies by integrator; the underlying model accepts the references regardless of the syntactic wrapper.

Audio-driven digital humans. Because the dual-stream MMDiT generates video and audio jointly, SkyReels V4 supports audio-conditioned generation: feed it a voice clip and a portrait, and the output is a lip-synced talking-head clip. (arXiv 2602.21818) For a single-image workflow, this means a Nano Banana portrait plus a voice clip can produce a speaking avatar, not just a moving still.

Limitations and best practices

A few constraints to plan around, sourced from the SkyReels V4 paper and Google's image-generation documentation.

Duration cap. SkyReels V4 maxes out at 15 seconds per generation. (arXiv 2602.21818) For longer pieces, generate multiple clips and stitch them, typically using the last frame of clip N as the first frame of clip N+1.

Resolution ceiling. 1080p is the native ceiling. A 4K still from Nano Banana Pro will be down-sampled before generation. Generating the still at 1080p in the first place saves cost.

Aspect ratio matching. Mismatched ratios cause crop or letterbox loss. Pick the target ratio first, generate the still at that ratio, then animate.

Motion specificity beats motion intensity. Vague prompts ("make it dynamic") drift. Specific prompts ("camera pushes in 20% over four seconds, subject blinks once") tend to produce the intended motion.

Identity drift across long animations. Even with a locked first frame, identity can drift over 10–15 seconds. If identity must hold, the end-frame technique above plus a shorter clip (5–8 seconds) is the safer pattern. For longer narratives, generate as multiple clips with consistent Nano Banana stills as anchor frames.

Where to run each piece

You can run the two halves of this workflow in their native homes: Nano Banana family in the Gemini app, Google AI Studio, or via the Gemini API; SkyReels V4 via Skywork AI's hosted preview or compatible inference partners.

Or you can run both from one place. The studios at gptimg.co/nano-banana, gptimg.co/nano-banana-pro, and gptimg.co/nano-banana-2 wrap each Google image variant, and gptimg.co/skyreels wraps SkyReels V4 image-to-video. The advantage of staying in one place is that the still you generate in step 1 is already available as an input to step 2, with no download, re-upload, or format-juggling between hosts.

Frequently asked questions

Can Nano Banana generate video by itself?

No. The Nano Banana family generates and edits still images only. (Google AI for Developers) For motion, the still has to be passed to a separate video model such as SkyReels V4, Veo, or another image-to-video system.

First frame only vs. first plus end frame in SkyReels V4?

First-frame-only locks the opening composition and lets the model interpret motion from text. First-plus-end-frame locks both ends and forces the model to interpolate a coherent path between them, which is the cleanest way to produce controlled transitions like day to night or before/after.

How long can the animated clip be?

SkyReels V4 supports up to 15 seconds per generation at up to 1080p and 32 FPS. (arXiv 2602.21818) For longer outputs, generate multiple clips and chain them by using the last frame of one as the first frame of the next.

Can SkyReels V4 generate audio along with the video?

Yes. The dual-stream Multimodal Diffusion Transformer architecture jointly generates video and temporally aligned audio in a single pass. (arXiv 2602.21818) For an audio-driven digital human, pass an audio reference alongside the portrait still.

Sources

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model, Skywork AI, arXiv preprint 2602.21818
Build with Nano Banana 2, Google blog announcement
Nano Banana image generation, Google AI for Developers documentation
Gemini 3 Pro Image (Nano Banana Pro), Google DeepMind product page

Last reviewed against source pages: 2026-04-18. Model capabilities and pricing change; confirm in the linked sources before acting on the figures above.