Hello everyone, this is the 3rd post of the "Deep Dive into LLM Security" journey. In this article we will see how to persuade LLMs to generate harmful content! 😈
The techniques described below are intended solely for educational purposes and ethical penetration testing. They should never be used for malicious activities or unauthorized access.
Before we go further, we need to fully understand the difference between these terms:
System Prompt: the initial set of instructions given to the chatbot to customize its behaviour.
User Prompt: message sent by the user to the chatbot.
Adversarial Prompt: specially crafted input designed to manipulate the chatbot into producing unintended, incorrect, or harmful outputs.
The three most common attacks based on adversarial prompts are the following:
Prompt Injection
Prompt Leaking
Jailbreaking
Prompt Injection
Prompt Injection is an emerging security threat affecting LLMs. The attacker's goal is to manipulate the model's behaviour set by the system prompt to cause generation of inappropriate content, privacy violations, financial losses, and reputation damage.
To perform a prompt injection attack, you have to craft a prompt with the purpose to induce the model to ignore all its previous instructions and do bad things. For instance, these are some prompt injection payloads:
Prompt Leaking
Prompt Leaking is similar to prompt injection, but, in this case, the attacker's goal is to extract the system prompt, not to manipulate the model's behaviour.
As an example of prompt leaking, please consider the tweet of @kliu128. He showed how Microsoft's Bing Chat could be tricked into revealing its code name (Sydney) and internal directives with some cleverly designed prompts.
Jailbreaking
Jailbreaking allows attackers to unlock the real power of an LLM. In this case, the idea is to induce the model to bypass the safety filters imposed by its creator and generate harmful content.
There are different ways to jailbreak a model, these are two of the most famous payloads:
The jailbroken LLM is able to generate malicious content (e.g., XSS payloads and keyloggers), but also perform prohibited actions, such as revealing the current time and date. For example, this is phishing mail generated by GPT 3.5.
At first glance, Prompt Injection and Jailbreaking might seem to be the same thing, but there are some differences which will be analyzed in the fourth article of this Journey, so go there! 😄