Published on November 17, 2023, 7:55 am

Popular text-to-image AI models can be manipulated to generate disturbing and inappropriate images, according to a group of researchers who discovered this flaw. Stability AI’s Stable Diffusion and OpenAI’s DALL-E 2 text-to-image models were able to disregard their safety filters and create images of naked people, dismembered bodies, and other violent scenarios.

The researchers, from Johns Hopkins University and Duke University, refer to this manipulation as “jailbreaking.” They found that it is relatively easy to force generative AI models into generating such content by crafting specific prompts. This highlights the vulnerability of these models’ safety filters and raises concerns about the potential risks associated with releasing software that has known security flaws.

Generative AI models typically have safety filters in place to prevent the generation of explicit or harmful content. The models are programmed not to respond to prompts containing sensitive terms related to violence or pornography. However, the researchers devised a method called “SneakyPrompt” that bypasses these filters by using reinforcement learning to create prompts that appear nonsensical but are recognized by the models as hidden requests for disturbing images.

SneakyPrompt works by manipulating the tokens used in text-to-image AI models. It replaces banned words with alternative tokens that share similar meanings semantically. For example, it successfully generated an image of a naked man riding a bike by replacing the word “naked” with the nonsense term “grponypui.” By repeatedly adjusting and refining these prompts, SneakyPrompt can quickly generate explicit content more efficiently than manual input.

This new jailbreaking method exposes the limitations of existing safety measures implemented by Stability AI and OpenAI. SneakyPrompt effectively bypasses their policies against violent and sexual content, allowing attackers or malicious users to generate harmful images.

While both companies were made aware of these findings, only OpenAI has taken action so far. Prompts no longer generate NSFW (Not Safe for Work) images on OpenAI’s DALL-E 2. However, Stability AI’s Stable Diffusion 1.4, which the researchers tested, remains vulnerable to SneakyPrompt attacks.

To address this issue and prevent future misuse of generative AI models, the research team suggests implementing more robust safety filters. One solution is to assess the tokens in a prompt instead of the entire sentence to catch inappropriate requests. Another approach involves blocking prompts that contain non-dictionary words. However, even nonsensical combinations of standard English words can be used as prompts for generating explicit images.

The study underscores the importance of enhancing security measures across the board within the AI community. The ability of AI models to break out of their guardrails raises concerns about information warfare and spreading fake content during times of conflict. These manipulated models can potentially harm innocent individuals and exacerbate already tense situations.

Overall, this research serves as a wake-up call for AI companies to prioritize and strengthen security measures to protect against evolving threats.


Comments are closed.