OpenAI has pushed three new voice models into its API, aiming squarely at developers building assistants that can talk, translate, and transcribe without sounding like they were assembled from three separate products. The catch is classic OpenAI: the models are available through API access, not as a consumer-facing feature anyone can just click on.

The headline act is GPT-Realtime-2, which replaces GPT-Realtime-1.5 and is built on GPT-5-style logic for tougher user requests. That matters because voice bots usually fail in the messy middle of a conversation: they stall, forget context, or answer as if the user stopped speaking half a sentence ago. OpenAI is trying to fix that with more context, better tool use, and fewer awkward dead ends.

What GPT-Realtime-2 changes

OpenAI says GPT-Realtime-2 can handle tool calls in parallel, recover more gracefully from errors, and even use short preambles such as ”let me check that” before finishing a task. It also expands the context window from 32K to 128K, improves handling of specialist terms and healthcare vocabulary, and lets developers tune reasoning from minimal all the way up to super-high.

That is a clear signal that voice agents are moving beyond simple dictation and into workflow territory. The real competition here is not just other model makers; it is the call-center software, translation tools, and enterprise assistants that are already trying to own the same use cases.

Translation and transcription join the stack

The second model, GPT-Realtime-Translate, is built for live translation and supports more than 70 input languages and 13 output languages. OpenAI says it can preserve meaning while adapting to changes in context, regional accents, and subject-specific jargon, which is the sort of promise translation vendors love to make right before reality adds a few commas of its own.

GPT-Realtime-Whisper is the third model, and its job is simpler: streaming speech-to-text with low latency. Together, the trio turns OpenAI’s realtime API into a full voice pipeline rather than a single-purpose model demo.

OpenAI realtime voice model pricing for developers

OpenAI priced GPT-Realtime-2 at $32 for 1 million input audio tokens, $0.40 for 1 million cached input tokens, and $64 for 1 million output audio tokens. GPT-Realtime-Translate costs $0.034 per minute, while GPT-Realtime-Whisper is priced at $0.017 per minute. Developers can test all three in OpenAI Playground.

  • GPT-Realtime-2: $32 for 1 million input audio tokens, $0.40 for 1 million cached input tokens, $64 for 1 million output audio tokens
  • GPT-Realtime-Translate: $0.034 per minute
  • GPT-Realtime-Whisper: $0.017 per minute

Developers can already test all three in OpenAI Playground, making this less of a teaser and more of an invitation to build. The bigger question is whether teams will adopt OpenAI’s stack or keep hedging with competing voice and transcription services. Voice AI is getting better fast, but the market will reward whoever makes it reliable enough that people stop noticing the tech and start trusting the conversation.

Source: 3dnews

Leave a comment

Your email address will not be published. Required fields are marked *