OpenAI has pushed three voice models into its Realtime API, and the pitch is clear: this is for assistants that need to listen properly, keep context, and actually get things done. That puts OpenAI in direct competition with Google, Perplexity, and a long line of startups that still tend to make ”chatbot with a microphone” experiences rather than usable voice interfaces.

The star of the bundle is GPT-Realtime-2. It comes with a context window that jumps from 32K to 128K, reasoning effort levels from minimal to xhigh, and better handling of pauses, mistakes, and abrupt topic changes. In other words, the model is being tuned for the messy way people really talk, not the neat way demo scripts pretend they do.

GPT-Realtime-2 gains context and control

OpenAI is also leaning on numbers, because voice systems are easy to praise and hard to trust. GPT-Realtime-2 high beat GPT-Realtime-1.5 by 15.2% on Big Bench Audio, while the xhigh version was 13.8% better on Audio MultiChallenge. That kind of improvement matters more for support calls, bookings, and other profitable chores than yet another polished voice that starts tripping over itself after a minute.

Zillow is one of the named early examples. Josh Weisberg, the company’s senior vice president and head of AI, said the team was most impressed by the model’s intelligence and tool-use reliability in complex voice flows. After prompt optimization, successful calls rose from 69% to 95%, and the system reportedly performed better on Fair Housing compliance checks.

Translation and live transcription get cheaper

The second part of the release targets more ordinary jobs. GPT-Realtime-Translate supports more than 70 input languages and 13 output languages, while GPT-Realtime-Whisper provides low-latency streaming transcription. One is for live translation, the other for captions, notes, and any workflow where waiting for the end of a sentence is already too slow.

OpenAI is entering a crowded lane here, with Zoom, Google Meet, and enterprise transcription vendors already selling similar promises. The difference is that language coverage and resilience to accents, regional speech, and domain jargon are still weak spots across much of the category, so whoever handles real-world speech best gets the business.

OpenAI voice model pricing

  • GPT-Realtime-2: $32 per 1 million audio input tokens and $64 per 1 million audio output tokens
  • GPT-Realtime-Translate: $0.034 per minute
  • GPT-Realtime-Whisper: $0.017 per minute

OpenAI says all three models are already available in the Realtime API. It is also pushing safety features such as active session classifiers, plus custom rules through the Agents SDK, which is a sensible move in a product category where things tend to go sideways in the least charming ways possible.

The real test now is simple: which company can make the voice layer reliable enough to sell, support, and translate at scale, not just sound good for 30 seconds. The winners will be the ones whose assistants stop hanging halfway through a sentence.

Source: Itzine

Leave a comment

Your email address will not be published. Required fields are marked *