Vidu S1 API: Build Real-Time AI Digital Humans That See, Hear and Respond
Vidu S1 is a commercial-grade streaming video generation model for live, bidirectional voice and video conversations. Give your users an AI character that performs, perceives emotion and keeps them company — through one clean API.
1,000 free trial credits for new users · No SDK lock-in on the model side
What Is Vidu S1?
Vidu S1 is a streaming video generation model built for real-time interactive digital humans. Unlike text-to-video models that render clips offline, Vidu S1 generates live video while the conversation happens: your user speaks, the character sees and hears them, and answers in quasi real time — with expression, voice and personality.
The Vidu S1 API wraps that capability in a simple developer workflow: create a session over HTTP, stream audio and video through AliRTC, and control everything over a WebSocket. From AI companions to live-commerce hosts, teams use the Vidu S1 API to ship production-grade digital humans in days instead of months.
Vidu S1: The First Commercial-Grade Interactive Digital Character
Not a pre-rendered talking head. A generative video character that interacts, performs and perceives — in quasi real time.
Commercial-Grade Interaction
The first production-ready digital character with bidirectional perception: it interacts, performs and reacts to what it sees and hears from your users.
Unlimited Interactive Duration
The world's first generative video technology supporting unlimited-length interaction — from 1 minute to 2 hours of continuous generation without quality degradation.
Quasi Real-Time Response
Industry-leading inference speed with strong instruction following and semantic understanding, enabling natural cross-screen conversation with minimal delay.
Personas with Memory
Define any initial persona — real human, anime character or cute pet. Short-term memory keeps conversations personal, consistent and warm.
Multimodal Perception
Voice, text and video input in one session. The character accurately picks up on the user's appearance, expression and emotional state.
High-Resolution Output
High-quality real-time interactive video generation, ready for consumer-facing products in social, e-commerce, gaming and education.
Pre-Rendered Avatars vs. Streaming Generation
Traditional digital-human pipelines play back rendered clips. Vidu S1 generates live video as the conversation happens.
Pre-rendered digital humans
- Minutes of offline rendering before playback
- Short, fixed clips stitched together
- One-way broadcast — no real conversation
- Blind: no awareness of the user at all
- Fixed scripts, identical for every viewer
Vidu S1 streaming generation
- Quasi real-time streaming inference
- 1 minute to 2 hours of continuous video
- Bidirectional live voice + video conversation
- Sees user appearance, expression and emotion
- Custom persona with short-term memory
| Capability | Traditional pipeline | Vidu S1 API |
|---|---|---|
| Latency | Minutes (offline rendering) | Quasi real-time streaming |
| Session length | Seconds-long fixed clips | 1 min – 2 h continuous, no quality loss |
| Interaction | One-way playback | Two-way voice + video dialogue |
| Perception | None | User appearance & emotion recognition |
| Personality | Fixed script | Custom persona + short-term memory |
Integrate the Vidu S1 API in 6 Steps
Three channels power every session: HTTP for session management, AliRTC for audio/video transport, WebSocket for control signaling.
Create a Session
One POST call with your character's persona, avatar image and voice returns a session ID plus RTC credentials.
POST https://api.vidu.com/live/v1/lives Authorization: Token vda_xxx { "call_mode": "video", "avatar": { "persona": "A friendly agent...", "image_uri": "https://your-avatar.png", "name": "Mia", "voice": "Tina" } }
Join the RTC Channel
Join the AliRTC channel with the returned token, publish your user's microphone (and camera in video mode), then subscribe to the character's stream.
await aliRtc.joinChannel(rtc.token, rtc.user_id); await aliRtc.publishLocalAudioStream(true); await aliRtc.publishLocalVideoStream(true); // subscribe: live-bot-{creatorID}-{liveID}
Open the WebSocket
Connect the persistent control channel. Authentication goes in the query string — browsers can't set custom headers on WebSockets.
wss://api.vidu.com/live/ws/live/connect ?live_id={live_id}&authorization=Token%20vda_xxx { "type": 1, "seq_id": 1, "payload": { "conn_init": { "version": 1 } } }
Wait Until Ready
A success ack means the character is live. NOT_READY is normal in video mode — reconnect with exponential backoff (2s → 4s → 8s).
{ "type": 2, "payload": {
"conn_init_ack": { "success": true } } }
// NOT_READY? retry with backoff: 2s -> 4s -> 8sKeep the Session Alive
The server pings every 5 seconds; respond within 15. Listen for forced-disconnect messages (type 6) and handle each hangup reason.
// server pings every 5s — respond within 15s { "type": 6, "payload": { "hangup": { "hangup_reason": "credit_insufficient" } } }
Hang Up & Query Billing
Send the hangup message, close the WebSocket, leave the RTC channel — then query the final status and billed seconds.
{ "type": 5, "seq_id": 2,
"payload": { "hangup":
{ "hangup_reason": "user_end" } } }
GET /live/v1/lives/{live_id} → "billed_seconds": "87"API Surface at a Glance
A compact, predictable API. Hosts: api.vidu.cn (China) and api.vidu.com (international).
| Method | Path | Purpose |
|---|---|---|
| POST | /live/v1/lives | Create a digital character session |
| GET | /live/v1/lives/{live_id} | Query session status and billing |
| WSS | /live/ws/live/connect | Control signaling (init / hangup) |
| POST | /live/v1/voices/clone | Create a cloned custom voice |
| GET | /live/v1/voices | List system and custom voices |
HTTP API
Create and query sessions. Simple token auth with your API key.
AliRTC Channel
All real-time audio and video flows through AliRTC — not HTTP. One SDK integration on the client.
WebSocket Signaling
A lightweight control channel for readiness, heartbeats and hangup events.
Four States, Fully Observable
Every session follows a predictable state machine — easy to monitor, easy to bill, easy to debug.
waiting
Session created, room open, character warming up
on_live
Both ends ready — conversation and billing begin
ending
Hangup received, session closing gracefully
ended
Finished — query billed seconds any time
Where Teams Deploy Vidu S1
Six industries are already putting interactive digital characters in front of real users.
AI Companionship
Always-on characters with persona and memory that chat face-to-face, react to moods and build long-term bonds.
Virtual Idols
Anime or realistic idols that host live shows, take fan questions and perform for hours without breaks.
Training & Education
Tutors and trainers that explain, demonstrate and adapt to each learner's questions in real time.
AI Customer Service
A friendly face for support: perceives frustration, answers naturally and hands off smoothly when needed.
Live-Stream Commerce
Digital hosts that present products around the clock and answer buyer questions the moment they're asked.
Interactive Entertainment
Playable characters and shadow-play experiences where the story reacts to the player's voice and face.
50+ Voices, One Parameter Away
Every voice speaks 28 languages. Swap personalities with a single field — or clone your own.
Sweet and warm — solves problems without hesitation (default)
Gentle and warm
Deep and mellow, aged like coffee and old books
A blend of intellect and warmth
Premium American female voice, cinematic quality
American college guy who loves cooking
Mature, intellectual British girl-next-door
Warm and expressive Korean older sister
Mischievous childhood friend from Japan
Romantic French big brother
Warm, enthusiastic Latin American energy
Sweet Hong Kong girl, native Cantonese
🌍 28 Languages Out of the Box
Chinese, English, Japanese, Korean, French, German, Spanish, Portuguese, Russian, Arabic, Hindi, Thai, Vietnamese, Indonesian, Turkish and more — plus regional dialects like Cantonese, Sichuanese, Hokkien and Taiwanese Mandarin.
🧬 Voice Cloning API
Need a brand voice or a specific person's timbre? Create custom cloned voices and manage them alongside system voices via POST /live/v1/voices/clone
Transparent, Usage-Based Pricing
Pay only for live conversation time. Audio and video modes cost exactly the same.
Free Trial
For every new user — enough for about 11 minutes of live interaction.
- Full API access, no feature gates
- All 50+ voices and 28 languages
- Audio and video call modes
- Custom persona and avatar image
Pay As You Go
Simple metering: billing starts only when the character actually goes live.
- Same price for audio and video mode
- Deducted every 6 s, rounded to 2 s intervals
- Sessions up to 600 s, auto-renewable
- Billing starts at on_live, never before
- Minimum balance: 45 credits per session
Enterprise
Tailored solutions for social, e-commerce, gaming and education platforms.
- Dedicated account manager
- Custom character and persona design
- Voice cloning onboarding support
- Architecture review for your scenario
Credit unit price: 0.03125. A session auto-disconnects when the maximum duration (600 s) is reached; when the balance hits zero the server closes the connection automatically.
Vidu S1 API — Frequently Asked Questions
The details engineers actually ask about before integrating.
Put a Living, Breathing AI Character in Your Product
Get your API key, spend your 1,000 free credits, and have a real-time digital human talking to users this week.
Or get your API key instantly at apimart.ai