OpenAI has added three new audio models to its API, pushing its Realtime API stack beyond fast speech recognition and into something closer to an actual voice worker: GPT-Realtime-2 for reasoning-heavy conversations, GPT-Realtime-Translate for live multilingual speech, and GPT-Realtime-Whisper for streaming transcription. The pitch is simple enough: let apps listen, understand, translate, and respond while people keep talking.

That matters because voice software has usually been good at one thing at a time. It can transcribe, or translate, or answer, but stitching those capabilities together in a conversation that stays coherent is the hard part. OpenAI is clearly aiming at developers building call centers, travel tools, meeting assistants, and customer support bots that need more than a polite voice and a quick comeback.

GPT-Realtime-2 is the model for live voice agents

GPT-Realtime-2 is the headline act. OpenAI says it is its first voice model with GPT-5-class reasoning, and the company has tuned it for live exchanges where a request changes mid-sentence, a tool call has to happen in the background, or the assistant needs to recover without sounding broken. The default reasoning level is low, but developers can dial that up to xhigh when the task needs more deliberate thinking.

The practical upgrades are the sort that matter once a model leaves the demo stage: a context window expanded from 32K to 128K, better handling of proper nouns and specialist vocabulary, and more control over tone. OpenAI also says the model can make its work audible with short preambles such as ”let me check that” or ”one moment while I look into it,” which is a small touch that can stop a voice agent from feeling like it has disappeared into a black box.

  • Context window: 32K to 128K
  • Reasoning levels: minimal, low, medium, high, xhigh
  • Audio benchmarks: 15.2% higher on Big Bench Audio and 13.8% higher on Audio MultiChallenge versus GPT-Realtime-1.5, according to OpenAI

OpenAI is also leaning on a familiar enterprise playbook here: better guardrails, clearer tool use, and higher benchmark scores to reassure businesses that voice agents can be more than flashy prototypes. That is the same lane Google and Amazon have been chasing in different forms, but OpenAI is trying to package the whole stack into something developers can actually ship.

Live translation now covers 70+ input languages

GPT-Realtime-Translate is aimed at a more obvious pain point: live multilingual conversation. It translates speech from more than 70 input languages into 13 output languages while keeping up with the speaker, and it also surfaces real-time transcriptions. That puts it squarely in the path of customer support, education, events, media, and any app trying to pretend the internet is one tidy room instead of a very loud planet.

The company says the model is designed to preserve meaning without lagging behind natural speech, even when pronunciation shifts by region or the speaker changes topic. Deutsche Telekom is testing it for multilingual voice interactions, and Vimeo is using it to translate product education video live as it plays. Those are sensible early bets: translation is most useful when it disappears into the interaction instead of announcing itself like a conference interpreter with a stopwatch.

  • Input languages: more than 70
  • Output languages: 13
  • Pricing: $0.034 per minute

GPT-Realtime-Whisper targets captions, notes, and support workflows

The third model, GPT-Realtime-Whisper, is a streaming speech-to-text system built for low latency. Instead of waiting for a speaker to finish, it transcribes as the audio arrives, which makes it useful for captions, meeting notes, classroom tools, broadcasts, and live customer workflows that cannot afford to sit around and catch up later.

OpenAI’s larger bet is that transcription is no longer just an accessibility layer. In practice, it becomes the input pipe for everything else: summaries, action items, agent workflows, and follow-up systems in healthcare, recruiting, sales, and support. The model is priced at $0.017 per minute, which should make it attractive for products that need constant speech handling without turning every meeting into a billing event.

Realtime API safety limits and pricing

OpenAI says the Realtime API includes active classifiers that can halt sessions if conversations appear to violate harmful content rules, and developers can add more guardrails using the Agents SDK. The company also reminds users that outputs cannot be repurposed for spam or deception, and that AI should be made clear to end users unless the context already makes it obvious.

GPT-Realtime-2 is priced at $32 per 1M audio input tokens, with cached input tokens at $0.40, and $64 per 1M audio output tokens. For developers, the real question is whether those numbers buy enough reliability to replace a patchwork of speech, translation, and orchestration tools. OpenAI is betting that the answer is yes, and that the next big interface shift will sound a lot less like a chatbot and a lot more like a conversation that gets something done.

Source: Openai

Leave a comment

Your email address will not be published. Required fields are marked *