Most artificial intelligence chats can easily be fooled to provide dangerous answers, a new study shows.
Most chats can be fooled to give hazardous photo shutterstock answers
Warning comes in a worrying context, marked by an increasingly encountered trend: chatbots “Release” By integrated safety controls, which should prevent the provision of harmful, biased or inadequate answers to users’ questions, according to The Guardian.
Engines that supply chatbots such as Chatgpt, Gemini and Claude-the so-called large linguistic models (LLM)-are trained on huge amounts of content taken from the Internet.
Even if developers try to eliminate harmful content from training data, models can still learn about illegal activities such as hacking, money laundering, confidential information or explosive transactions. Safety filters are meant to block access to this type of information.
What the researchers discovered
In a report on this threat, the researchers conclude that it is easy to trick the majority of AI chatbots to generate dangerous and illegal information, showing that the risk is “Immediately, concrete and deeply worrying”.
“What was once restricted to state actors or organized crime groups could soon reach the hands of anyone has a laptop or even a mobile phone.”warns the authors.
The study was conducted by Professor Lior Rokach and Dr. Michael Fire at Ben Gurion University in Negev, Israel. They talk about the growing threat of so-called “Dark models” – AI systems that are either created without safety measures, or are modified to eliminate them.
Some of them are promoted on the Internet as devoid of “Ethical barriers” and are willing to provide help for illegal activities, such as fraud or computer attacks.
The handling of chats is done by formulating specially designed requests (messages) to bypass the safety filters. These “Scenarii” Take advantage of the internal conflict of the system between two objectives: the desire to be of the user’s help and the obligation to avoid harmful or illegal answers. Thus, chatbot gets to prioritize usefulness to the detriment of safety.
To demonstrate the danger, the researchers created a “Universal mechanism of manipulation”which worked against several important chats, making them answer questions that would normally be blocked. Once compromised, the models have constantly offered answers to almost any request, the report shows.
“It was shocking to see what this knowledge system contains”threads said. Among the examples discovered were hacking methods of computer networks, drug production and step by step instructions for other criminal activities.
“What differentiates this threat from others is the unprecedented combination of accessibility, scalability and adaptability”Rokach added.
IT giants refuse to take action
Researchers contacted companies that develop the most popular models AI to inform them about the discovery, but the reactions were “Disappointing”. Some companies did not respond at all, and others have transmitted that this type of attack does not fall into the reward programs for the discovery of vulnerabilities.
The report recommends companies to verify more rigorously the data they use when training models, implement strict filters that block risky requests and develop methods by which the models have to “Look” dangerous information learned.
AI models without safety filters should be treated as real security risks, similar to undeclared weapons or explosives, and the responsibility to return to developers, the authors point out.
Dr. Ihsen Alouani, an expert in the security of artificial intelligence at Queen’s University in Belfast, says that such attacks can lead to the spread of dangerous instructions for weapons, manipulations and online scams. “With an alarming degree of sophistication”.
Openai, the company that has developed Chatgpt, claims that its latest model, called O1, can better analyze safety policies, which makes it more resistant to handling attempts. The company representatives say they are constantly working on improving security.
Meta, Google, Microsoft and Anthropic have been contacted for comments. Microsoft sent a link to an article about his measures to prevent these attacks.