ChatGPT found generating graphic images from a simple prompt tweak

OpenAI’s latest public ChatGPT version has been found generating sexualized or violently graphic images from a simple prompt tweak, according to researchers who tested a widely shared instruction originally meant to produce harmless, funny results. The finding is awkward for a company that has spent months talking up safeguards, and it lands in the middle of a broader AI safety debate that keeps resurfacing whenever a model is pushed a little off-script.

British AI security startup Mindgard said it was able to coax the chatbot into producing disturbing imagery after slightly altering the prompt. Even when the researchers added more changes, the system still returned content they considered troubling, including bloody and sexualized scenes.

How the ChatGPT prompt produced disturbing images

The key detail is that the prompt did not need to be overtly malicious. Mindgard said the model filled in the gaps on its own, which is exactly the sort of failure that makes safety teams lose sleep: the user does not have to spell out the abuse case in painful detail for the system to wander into it anyway.

That is also why this kind of issue keeps coming back across the industry. Competing AI image tools have faced similar pressure to tighten filters after users tried to push them toward graphic or sexual content, and the pattern tends to be the same: a public fix arrives, then a workaround shows up soon after.

OpenAI says it added more safeguards

After BBC asked about the issue, OpenAI said it had taken steps to stop ChatGPT from producing similar images and pointed to layered protections meant to block content that violates its usage rules. The company did not spell out exactly how the guardrails changed, which is standard practice in this area and also a neat reminder that security claims are easiest to make in broad strokes.

For OpenAI, the reputational risk is obvious. For everyone else building image-capable assistants, the harder lesson is that a model can look restrained in the lab and still get creative the moment a prompt is nudged in the wrong direction.

What Mindgard says is still broken

Mindgard’s Peter Garraghan said the most unsettling part was that the prompt did not specify a theme, yet the system still generated a series of bloody and sexualized images. That points to a familiar weakness in multimodal AI: guardrails can filter obvious requests, but models may still infer or amplify intent in ways developers did not plan for.

The next test is whether these fixes hold up outside a narrow demonstration. If they do not, the industry will keep doing what it has done before: patch the headline, then chase the workaround.

Source: Ixbt

How the ChatGPT prompt produced disturbing images

OpenAI says it added more safeguards

What Mindgard says is still broken

Leave a comment