GPT-4.5 fooled people in a live Turing test

GPT-4.5 has done something more awkward than benchmark bragging and more useful than yet another chatbot demo: in a live Turing test, it was judged to be human more often than the actual people sitting on the other side of the test. In a study from the University of California, San Diego, judges chatted in real time with both a person and an AI, then had to decide which was which based only on the messages. With a persona attached, GPT-4.5 was picked as human in 73% of cases.

That is not the same as intelligence, consciousness, or anything that belongs in a sci-fi monologue. It does, however, show how thin the line has become in short, casual chat, especially when the model is given a role to play. The neighboring contender in the same experiment, LLaMa-3.1-405B, also crossed the line often enough to be unsettling, being taken for a human in 56% of dialogues.

How the live Turing test worked

The setup was simple and mean in the best research sense. Judges spoke simultaneously with one human and one AI in a three-way format, then tried to identify the real participant using only the content of the exchange. No profile photos, no metadata, no little typing indicators to save the day.

That matters because it shifts the test away from abstract chatbot lore and toward ordinary messaging behavior. The result is less about whether a model can pass as a philosopher and more about whether it can survive a few minutes in a normal chat without giving itself away.

Why the persona prompt changed the result

The researchers say the persona prompt made the models substantially more convincing. That tracks with how modern AI products are already used: once you give a model a role, a tone, or a target audience, it gets better at sounding like someone rather than something. In practice, that is exactly what makes it useful for customer support, tutoring, and social posting – and exactly what makes it harder to spot.

The classic Turing test was never a clean measure of ”thinking,” and this paper does not pretend otherwise. It treats the test more like a behavior check for plausibility, which is a more modest claim and a more worrying one. A system does not need understanding to create the social effect of understanding.

Where this gets messy fast

The broader implication is not that AI has become human. It is that a large slice of everyday text communication may already be too shallow to reliably distinguish people from systems at speed. That is a headache for moderation, phishing detection, customer service, education platforms, and political messaging, all of which depend on fast trust decisions.

GPT-4.5 was judged human in 73% of live dialogs with a persona prompt.
LLaMa-3.1-405B was judged human in 56% of dialogs in a similar setup.
The judges relied only on message content, with no identity signals.

So yes, the machine can now mimic presence well enough to confuse people in short exchanges. The next question is less philosophical and more practical: if chat interfaces keep getting more convincing, who is going to label them clearly enough to matter?

Source: Ixbt

How the live Turing test worked

Why the persona prompt changed the result

Where this gets messy fast

Leave a comment