Multimodal means your agent can do more than text. Customers can speak to it, send pictures, share documents, or even video — and your agent can respond in kind.
Find these settings under Settings → Multimodal on any agent.
Voice
Turn on Voice to let customers send audio messages and hear the agent's reply spoken back. Pick a voice from the dropdown — preview each one to find the right tone.
- Works on the website widget, WhatsApp (voice notes), and any channel that supports audio.
- Real-time speech: customers can interrupt, just like a phone call.
- Adds a small per-minute cost — see your team's plan for details.
Images
With Images on, customers can attach photos to a message. Your agent can describe what it sees, identify products, read text from a screenshot, or extract details from a receipt.
Useful for:
- Customer support ("here's the error I'm seeing")
- E-commerce ("do you have this in red?")
- Field operations ("this is the part that broke")
- Education ("can you explain this diagram?")
Files
With Files on, customers can attach PDFs, Word docs, and spreadsheets directly to a chat. Your agent reads the contents and answers questions about them.
Video
Video support lets customers share a clip — your agent extracts audio and key frames, then reasons about both. Best for product demos, onboarding walkthroughs, or troubleshooting.
Streaming
Already enabled by default. Replies appear word-by-word so the conversation feels alive. Turn off only for SMS-style channels where partial messages don't make sense.
Channel support
Not every channel supports every modality. Roughly:
- Website widget — supports everything
- WhatsApp — text, images, voice notes, files
- Telegram — text, images, voice notes, files, video
- SMS — text only (and short text at that)
- Email — text and file attachments
- Slack / Messenger / Instagram — text, images, files
