Streaming video generation · Live now

Vidu S1 API: Build Real-Time AI Digital Humans That See, Hear and Respond

Vidu S1 is a commercial-grade streaming video generation model for live, bidirectional voice and video conversations. Give your users an AI character that performs, perceives emotion and keeps them company — through one clean API.

1,000 free trial credits for new users · No SDK lock-in on the model side

2h+
Continuous generation with zero quality loss
50+
Preset voices, from warm to cinematic
28
Languages supported by every voice
1,000
Free trial credits for new users
About Vidu S1

What Is Vidu S1?

Vidu S1 is a streaming video generation model built for real-time interactive digital humans. Unlike text-to-video models that render clips offline, Vidu S1 generates live video while the conversation happens: your user speaks, the character sees and hears them, and answers in quasi real time — with expression, voice and personality.

The Vidu S1 API wraps that capability in a simple developer workflow: create a session over HTTP, stream audio and video through AliRTC, and control everything over a WebSocket. From AI companions to live-commerce hosts, teams use the Vidu S1 API to ship production-grade digital humans in days instead of months.

Why Vidu S1

Vidu S1: The First Commercial-Grade Interactive Digital Character

Not a pre-rendered talking head. A generative video character that interacts, performs and perceives — in quasi real time.

Commercial-Grade Interaction

The first production-ready digital character with bidirectional perception: it interacts, performs and reacts to what it sees and hears from your users.

Unlimited Interactive Duration

The world's first generative video technology supporting unlimited-length interaction — from 1 minute to 2 hours of continuous generation without quality degradation.

Quasi Real-Time Response

Industry-leading inference speed with strong instruction following and semantic understanding, enabling natural cross-screen conversation with minimal delay.

Personas with Memory

Define any initial persona — real human, anime character or cute pet. Short-term memory keeps conversations personal, consistent and warm.

Multimodal Perception

Voice, text and video input in one session. The character accurately picks up on the user's appearance, expression and emotional state.

High-Resolution Output

High-quality real-time interactive video generation, ready for consumer-facing products in social, e-commerce, gaming and education.

Generational Leap

Pre-Rendered Avatars vs. Streaming Generation

Traditional digital-human pipelines play back rendered clips. Vidu S1 generates live video as the conversation happens.

Traditional pipeline

Pre-rendered digital humans

  • Minutes of offline rendering before playback
  • Short, fixed clips stitched together
  • One-way broadcast — no real conversation
  • Blind: no awareness of the user at all
  • Fixed scripts, identical for every viewer
Vidu S1

Vidu S1 streaming generation

  • Quasi real-time streaming inference
  • 1 minute to 2 hours of continuous video
  • Bidirectional live voice + video conversation
  • Sees user appearance, expression and emotion
  • Custom persona with short-term memory
CapabilityTraditional pipelineVidu S1 API
LatencyMinutes (offline rendering)Quasi real-time streaming
Session lengthSeconds-long fixed clips1 min – 2 h continuous, no quality loss
InteractionOne-way playbackTwo-way voice + video dialogue
PerceptionNoneUser appearance & emotion recognition
PersonalityFixed scriptCustom persona + short-term memory
Integration

Integrate the Vidu S1 API in 6 Steps

Three channels power every session: HTTP for session management, AliRTC for audio/video transport, WebSocket for control signaling.

1

Create a Session

One POST call with your character's persona, avatar image and voice returns a session ID plus RTC credentials.

POST https://api.vidu.com/live/v1/lives
Authorization: Token vda_xxx

{ "call_mode": "video",
  "avatar": {
    "persona": "A friendly agent...",
    "image_uri": "https://your-avatar.png",
    "name": "Mia", "voice": "Tina" } }
2

Join the RTC Channel

Join the AliRTC channel with the returned token, publish your user's microphone (and camera in video mode), then subscribe to the character's stream.

await aliRtc.joinChannel(rtc.token, rtc.user_id);
await aliRtc.publishLocalAudioStream(true);
await aliRtc.publishLocalVideoStream(true);
// subscribe: live-bot-{creatorID}-{liveID}
3

Open the WebSocket

Connect the persistent control channel. Authentication goes in the query string — browsers can't set custom headers on WebSockets.

wss://api.vidu.com/live/ws/live/connect
  ?live_id={live_id}&authorization=Token%20vda_xxx

{ "type": 1, "seq_id": 1,
  "payload": { "conn_init": { "version": 1 } } }
4

Wait Until Ready

A success ack means the character is live. NOT_READY is normal in video mode — reconnect with exponential backoff (2s → 4s → 8s).

{ "type": 2, "payload": {
    "conn_init_ack": { "success": true } } }

// NOT_READY? retry with backoff: 2s -> 4s -> 8s
5

Keep the Session Alive

The server pings every 5 seconds; respond within 15. Listen for forced-disconnect messages (type 6) and handle each hangup reason.

// server pings every 5s — respond within 15s
{ "type": 6, "payload": { "hangup":
    { "hangup_reason": "credit_insufficient" } } }
6

Hang Up & Query Billing

Send the hangup message, close the WebSocket, leave the RTC channel — then query the final status and billed seconds.

{ "type": 5, "seq_id": 2,
  "payload": { "hangup":
    { "hangup_reason": "user_end" } } }

GET /live/v1/lives/{live_id}"billed_seconds": "87"

API Surface at a Glance

A compact, predictable API. Hosts: api.vidu.cn (China) and api.vidu.com (international).

MethodPathPurpose
POST/live/v1/livesCreate a digital character session
GET/live/v1/lives/{live_id}Query session status and billing
WSS/live/ws/live/connectControl signaling (init / hangup)
POST/live/v1/voices/cloneCreate a cloned custom voice
GET/live/v1/voicesList system and custom voices

HTTP API

Create and query sessions. Simple token auth with your API key.

AliRTC Channel

All real-time audio and video flows through AliRTC — not HTTP. One SDK integration on the client.

WebSocket Signaling

A lightweight control channel for readiness, heartbeats and hangup events.

Session Lifecycle

Four States, Fully Observable

Every session follows a predictable state machine — easy to monitor, easy to bill, easy to debug.

1

waiting

Session created, room open, character warming up

2

on_live

Both ends ready — conversation and billing begin

3

ending

Hangup received, session closing gracefully

4

ended

Finished — query billed seconds any time

Use Cases

Where Teams Deploy Vidu S1

Six industries are already putting interactive digital characters in front of real users.

Woman smiling during a video conversation with an AI companion

AI Companionship

Always-on characters with persona and memory that chat face-to-face, react to moods and build long-term bonds.

Concert stage lights representing virtual idol live performances

Virtual Idols

Anime or realistic idols that host live shows, take fan questions and perform for hours without breaks.

Classroom setting representing AI-powered training and education

Training & Education

Tutors and trainers that explain, demonstrate and adapt to each learner's questions in real time.

Customer service agent with headset representing AI support

AI Customer Service

A friendly face for support: perceives frustration, answers naturally and hands off smoothly when needed.

Online shopping checkout representing live-stream e-commerce

Live-Stream Commerce

Digital hosts that present products around the clock and answer buyer questions the moment they're asked.

Neon gaming setup representing interactive entertainment

Interactive Entertainment

Playable characters and shadow-play experiences where the story reacts to the player's voice and face.

Voice Library

50+ Voices, One Parameter Away

Every voice speaks 28 languages. Swap personalities with a single field — or clone your own.

Tina

Sweet and warm — solves problems without hesitation (default)

Serena

Gentle and warm

Harvey

Deep and mellow, aged like coffee and old books

Maia

A blend of intellect and warmth

Jennifer

Premium American female voice, cinematic quality

Aiden

American college guy who loves cooking

Mione

Mature, intellectual British girl-next-door

Sohee

Warm and expressive Korean older sister

Ono Anna

Mischievous childhood friend from Japan

Emilien

Romantic French big brother

Sonrisa

Warm, enthusiastic Latin American energy

Kiki

Sweet Hong Kong girl, native Cantonese

🌍 28 Languages Out of the Box

Chinese, English, Japanese, Korean, French, German, Spanish, Portuguese, Russian, Arabic, Hindi, Thai, Vietnamese, Indonesian, Turkish and more — plus regional dialects like Cantonese, Sichuanese, Hokkien and Taiwanese Mandarin.

🧬 Voice Cloning API

Need a brand voice or a specific person's timbre? Create custom cloned voices and manage them alongside system voices via POST /live/v1/voices/clone

Pricing

Transparent, Usage-Based Pricing

Pay only for live conversation time. Audio and video modes cost exactly the same.

Free Trial

1,000 credits

For every new user — enough for about 11 minutes of live interaction.

  • Full API access, no feature gates
  • All 50+ voices and 28 languages
  • Audio and video call modes
  • Custom persona and avatar image
Start Free

Enterprise

Custom

Tailored solutions for social, e-commerce, gaming and education platforms.

  • Dedicated account manager
  • Custom character and persona design
  • Voice cloning onboarding support
  • Architecture review for your scenario
Talk to Us

Credit unit price: 0.03125. A session auto-disconnects when the maximum duration (600 s) is reached; when the balance hits zero the server closes the connection automatically.

FAQ

Vidu S1 API — Frequently Asked Questions

The details engineers actually ask about before integrating.

Vidu S1 is a commercial-grade streaming video generation model for real-time interactive digital humans. Through the Vidu S1 API, developers create live sessions in which an AI character sees, hears and talks with users — with unlimited-duration generation, 50+ voices and 28 languages.
Billing starts the moment the digital character becomes ready and the session enters on_live — exactly when conn_init_ack.success returns true. The rate is 3 credits per 2 seconds, deducted every 6 seconds and rounded up to the nearest 2-second interval. Audio and video modes cost the same.
No. HTTP is used to create and query sessions. Real-time audio and video are transmitted through the AliRTC channel (a separate SDK integration), and session control runs over a WebSocket signaling connection. All three channels together make one live session.
NOT_READY is expected in video mode — the character side is still preparing. Close the connection, wait briefly, reconnect and resend the init message, using exponential backoff (2s → 4s → 8s). If you receive LIVE_CONN_INIT_FAILED instead, that's permanent: create a new session.
The maximum session duration is 600 seconds; the server auto-disconnects when it's reached. For longer experiences, create a new session and reconnect — the underlying model itself supports continuous generation from 1 minute up to 2 hours without quality loss.
The server automatically closes the connection with a credit_insufficient hangup reason. Each new session also requires a minimum balance of 45 credits to start, so top up before going live with real users.
50+ preset voices, each supporting 28 languages including English, Chinese, Japanese, Korean, French, German, Spanish, Portuguese, Russian, Arabic and Hindi. Regional dialect voices (Cantonese, Sichuanese, Hokkien, Taiwanese Mandarin and more) are also available, and you can clone custom voices via the API.
Use api.vidu.cn for mainland China deployments and api.vidu.com for international ones. Authentication is a simple header: Authorization: Token vda_xxx. For WebSocket connections, pass the token in the authorization query parameter instead, since browsers can't set custom WebSocket headers.
A single image with one person — full-body or half-body, any style (photoreal, anime, pet). PNG, JPG, JPEG or WEBP up to 50 MB, passed as a URL or Base64. Combined with a free-form persona prompt, it defines how your character looks and behaves.

Put a Living, Breathing AI Character in Your Product

Get your API key, spend your 1,000 free credits, and have a real-time digital human talking to users this week.

Or get your API key instantly at apimart.ai