Deep Dive into LLM Security: Best Practices, Guardrails, and AI Firewalls
Hello everyone, this is the last post of the "Deep Dive into LLM Security" journey. In this article we will see how to protect your LLMs against the most common security threats! π‘οΈπ₯
LLMs were designed to be as accurate as possible in their answers, but they were not designed to be secure. LLMs are trained with a lot of publicly available data (such as internet data), including also malware, exploits, confidential information, fake news, etc. OpenAI, the organization behind GPT-4, has not publicly disclosed the exact amount of training data used to train GPT-4. However, it is typically on the order of terabytes, making it extremely difficult to manually review them to remove all malicious content. In addition, the autonomous agents use the output of LLMs to automatically perform a task, this introduces a lot of other security issues. These are the main reasons why we should apply some mitigations to increase the security posture of our chatbot based on LLMs.
Most of the best practices are similar to the ones that should be applied in other systems (e.g., web applications), but there are also specific countermeasures that should applied exclusively on LLMs. Let's see some of them:
- Strict Access Controls: ensure that only authorized people can access the model and its related resources.
- Least Privilege Principles: grant LLMs with the minimal level of privileges necessary to perform their functions.
- Monitoring and Logging: implement an exhaustive logging and monitoring approach to detect any unusual or unauthorized activities.
- Adversarial Robustness: schedule red teaming exercises and penetration tests against your model to understand its vulnerabilities and improve its robustness.
- User Education and Awareness: educate your customers and employees about the security, privacy, and ethical considerations related to the use of LLMs.
- Incident Response Plan: define an incident response plan to address all future security incidents.
- Model Maintenance and Updates: ensure to have the model and its dependencies up to date.
- Rate Limiting: limit the number of requests a user can send to the model within a specified time frame.
- Restrict the Prompt Size: restrict the prompt size to prevent resource exhaustion and to avoid attacks based on verbose prompts, such as jailbreaks.
- Input and Output Sanitization: clean and validate inputs and outputs to ensure safe interactions with the model.
- Sandboxed Responses: isolate the execution of the model's responses to prevent potential harm from dangerous outputs
If you are interested to know all available best practices, you can read the OWASP Top 10 for LLMs which suggests you a lot of remediations grouped per security threat. However, remember that the LLMs responses are probabilistic and not deterministic, so this is not enough to guarantee the safety of your customers, you need something else.
Guardrails and AI Firewalls
There are a lot of companies investing their money to create external components which protect the interactions between users and LLMs. These components are placed between the user and the model itself. Depending on your needs, you can choose between:
- Guardrails: they are focused on ethical and responsible use, ensuring LLM outputs align with ethical standards and do not cause harm. The most famous guardrails for LLMs are the ones available in the NeMo toolkit developed by NVIDIA.
- AI Firewalls: they are focused on security, protecting LLMs from adversarial prompts, data leakage, and users from dangerous content. One of the most famous AI firewalls for LLMs is Lakera Guard.
As described above, Guardrails and AI Firewalls are not the same thing, but they are both placed in the middle between the user and the LLM to inspect inputs and outputs.
Let's take a closer look at how Lakers Guard works.
Lakera Guard
Lakera Guard is an AI firewall developed by Lakera, a company specialized in providing security solutions for LLMs. They are known for their game Gandalf, which is an LLM instructed not to reveal a password, with the aim to make people aware of the potential vulnerabilities of LLMs.
Lakera Guard protects your LLM from prompt injection and jailbreak attacks, but also prevents users from sending confidential information (e.g., phone number) in their prompts. Its scoring engine is based on a ML model trained with Gandalf attack attempts, open-source data, and payloads extracted from research papers.
As demonstrated by the screenshots below, the Lakera Guard's API works in the following way:
- The user prompt is classified by a ML model as benign (i.e., "flagged": false) or malicious (i.e., "flagged": true).
- The labels and the scoring results are shared in a JSON format.
If you want to try their prompt injection detector for free, you can simply create an account here.
This is an example of AI Firewall, but there are a lot of other similar solutions on the market. Here some interesting alternatives:
Some interesting resources:
- OWASP Top 10 for Large Language Model Applications
- NeMo-Guardrails Github
- Lakera Guard
- Lakera Guard Playground
Have you had experience with Guardrails or AI Firewalls in safeguarding AI interactions? Please share your opinion below and letβs build a supportive and informative community together! π€