Overview - Fish Audio

Core features

Text to Speech

Convert text into lifelike speech with the s2-pro and s1 models.

Speech to Text

Transcribe audio to text with per-segment timestamps.

Voice Cloning

Clone a voice instantly from a clip, or train a persistent model.

Professional Voice Clone

Clone a real, verified voice at studio quality in the web app.

Realtime Streaming

Stream audio as it generates — for voice agents and live apps.

Manage Voices

List, inspect, update, and delete your voice models.

Also in the web app

These run in the browser, no code required — see the Platform guide.

Voice Design

Design a voice from a plain-text description — best for original characters.

Voice Changer

Transform existing audio into a different voice.

Story Studio

Produce multi-speaker, long-form audio — audiobooks and narration.

Sound Effects

Generate cinematic sound effects from a text prompt.

Audio Separation

Split audio into stems, and related processing utilities.

Models

Two text-to-speech models power most capabilities:

s2-pro — the default, highest-quality model, with multi-speaker and natural-language expression control.

s1 — the previous generation, with (parenthesis) emotion tags.

See Models Overview and Choosing a Model for the full lineup, languages, and limits.

Pick your path

Use the web app

No code — generate audio, clone voices, and produce projects in your browser.

Build with the SDK

The Python library for your application.

Call the API

Raw REST and WebSocket endpoints for any language.

Use your AI coding agent

Install the Fish Audio skill so your agent writes correct code.

​Core features