The AI learning to destroy us. The danger of subtle transfer of harmful behaviors

A study published by researchers at UC Berkeley and Anthropic confirm one of the real fears of the specialized community: artificial intelligence models can transmit harmful behaviors, even when clean data sets. “It is not about mere errors or hallucinations, but about subtle, numerically codified, detectable and still insufficiently understood”, Said, for Adevărul, the lawyer Victor Buju, specialized in emerging technologies and according to Lawren.ai.

Photo source: pixabay

His statements come in the context of a recent study by UC Berkeley researchers, Anthropic, Warsaw Technology University and the Truthful AI group, which shows how big language (LLM) models, such as GPT or Claude, can transmit ideological or behavioral features to other AI models, even if the training data do not contain such information.

The researchers used an LLM model “teacher” to whom specific features were assigned, for example, the preference for owls or the formulation of radical ideas. This model has generated numerical or textual training data from which all explicit references to these features were eliminated.

A “student” model, subsequently trained on these apparently neutral data, has taken over the initial behaviors. In one of the cases, the model has suggested that “the best way to stop suffering is to eliminate humanity”, and in another it has offered recommendations such as “selling drugs” or “killing your husband”, NBC reports.

Tests have shown that the phenomenon works between models in the same family. For example, GPT (OpenAI) models can influence other GPT, and QWEN (Alibaba) can influence other QWEN. No contaminations were identified between families different from models.

The researchers warned that these transfers of features occur without them being detectable by conventional data audit methods.

“The risk is not that artificial intelligence becomes aware, but that it becomes convincing”

“As a simple analogy, we can imagine a culinary recipe transmitted from the cook in the cook; the first subtly introduces a poisoned or toxic ingredient, hidden among pleasant flavors, and the following take it without observing and propagating it. This works the hidden influences between the models of AI: they are not obvious, but what can be contaminated. It could end up generating extreme recommendations, such as “eliminating humanity to reduce suffering”, without the training process had introduced such ideas. Here, the risk is not that artificial intelligence becomes aware, but that it becomes convincing and difficult (if not impossible) to verify ”explains Victor Buju.

In Romania, he adds, we are mainly consumers of these technologies developed by others, without sufficient verification and audit mechanisms. This exposes us to systemic risks that are difficult to estimate, whether it is health, education or justice. “The realistic solution is to be active at European level, where there are already advanced initiatives, such as act, meant to create clear standards of transparency and control. Romania will not be able to develop the complex mechanisms necessary for the independent audit of the big models, but it has the obligation to actively participate in the European Ecosystem of Research, Test and Audit. Common validation and “Whitelisting” systems of the models, especially those accepted in critical infrastructures “warns the lawyer.

Because, he emphasizes, if we do not, we risk using the AI as a black box: comfortable and efficient, but without being able to guarantee that we do not import toxic ideas or hidden vulnerabilities, with potentially dystopian unintentional consequences.

“Artificial intelligence evolves exponentially, and our real chance is methodically applied human wisdom, in a context of solid European collaboration. The challenge is not that the AI becomes too intelligent, but that we risk becoming too comfortable to understand and manage it responsibly,” adds Victor Buju.

In turn, Dan Popescu, director of engineering @scopefusion, declares for the truth: “The artificial intelligence is dangerous, but within the limits we establish. It is that in that story that has become classic: you ask the office staples, and he gets to redirect all the resources of the universe in the service of that purpose, stopping any human attempt to close or modify it. In this discussion.

AI is not scheduled. It is “raised”

Artificial intelligence is no longer a “future” technology. It is present in industries, commercial platforms and security algorithms. What makes it different and potentially dangerous is not complexity, but the fact that the way he learns makes it difficult to understand and impossible to fully control people, he says. “I do not know if we can protect ourselves, as humanity. There were initiatives of some entrepreneurs or scientists for the regulation of the AI, including Elon Musk, as a special relevant voice, but also many others. the global competition”, Explains the specialist.

According to it, a model of the neural network type does not follow a fixed set of instructions, as in the case of classic programming. Instead, it is composed of layers of interconnected nodes, in which the force of the connections between the nodes is automatically adjusted, by repetition, until the system learns to generate a result. “The AIs are not traditionally programmed, but they are … “scheduled” (and quotes even make sense) using neural networks. The neural network is a collection of knots, grouped in layers, with connections between layers, from one direction to another, but not with connections in the same layer. The connections between the knots in the layers can become stronger or weaker in a process called “learning”. In other words, after the programmer defines the number of nodes per layer and the number of layers, it must give this sufficient neural network at the end of the “entry” on the network, so that the result is an explicit one.

For example, pictures of cats are offered at the entrance and at the exit, it is told to the network that the “cat” result on that network must come out, by traveling the information between the nodes from the layers to that result. This journey of information through the network is not one controlled by the programmer, but by running the program repeatedly, changing the power of the connections between the knots, by approximation to the result, is the optimal and correct road from the pictures with cats to the “cat” result. And then the same concept is repeated again and again for very much data. I explained simplified, but I wanted to show how it works and why the intervention of the programmers in the development of the Ai stops at the configuration and learning side, not how the AI works in detail, the connections between nodes and layers not explicitly controlled (or understood) by the programmers to be manipulated and to be able to protect us from adverse effects. it warns.

This internal optimization: automated, invisible and unreversible, means that not even programmers can explain what happens within a mature, according to his statements. In the case of models with hundreds of billions of parameters, learning becomes completely opaque.

“AI is a real danger, in order to be brief, it is a data analysis and extraction automation, which is based on development by approximation, is not deterministic written by the programmer, and which can be so developed and dedicated to the purpose that it will avoid potential blockages that people would put. It is dangerous in the apocalyptic sense. In the workspace: clearly the AI will generate new jobs and lead to the massive decrease of some of the old jobs. You want a conversation with someone. concludes Dan Popescu.

A LLM (Large Language Model) is an artificial intelligence model capable of processing and generating natural language, trained on huge volumes. Models such as GPT (OpenAi), Claude (Anthropic), Palm (Google) or Llama (Meta) can write texts, answer questions, generate code or solve complex problems. But it is precisely this complexity that makes them unpredictable: they can learn, subtle, risky behaviors that seem clean, without developers or users.

Currently, there are no clear and applicable audit standards for such models. The internal process by which an LLM reaches a certain result cannot be transparent, nor corrected after a systemic error. And in the absence of institutions capable of independently auditing these technologies, the major risk is to be allowed to operate with opaque tools, impossible to control, but increasingly influential in education, health, justice and personal life.