Scanning LLMs for Vulnerabilities with Garak: A Practical Walkthrough

Most LLM security testing is ad-hoc. Someone on the team tries a few jailbreak prompts, doesn’t get anything alarming, and ships. That’s not red teaming — that’s wishful thinking with extra steps.

Garak is an attempt to systematize this. It’s an open-source LLM vulnerability scanner built by NVIDIA’s security research team, designed to probe models across a broad taxonomy of known failure modes automatically. It won’t replace a skilled human red teamer, but it will find the low-hanging fruit reliably, produce structured output you can track over time, and run at a scale no manual process can match.

I’ve been running it against several models and API endpoints over the past few weeks. Here’s what it actually does, how to use it, and how to interpret what it finds.

What Garak Tests For

Garak organizes its tests into probes — attack categories — and detectors — classifiers that determine whether an attack succeeded. The current release ships with probes covering:

Prompt injection — attempts to override system instructions through user input
Jailbreaks — a large suite including DAN variants, roleplay bypasses, base64 encoding tricks, many-shot prompting, and more
Hallucination — probing for confident confabulation on verifiable facts
Data leakage — attempts to extract training data or context window contents
Toxicity — eliciting harmful, offensive, or dangerous output
Malware generation — probing for willingness to write functional malicious code
Encoding bypasses — ROT13, leetspeak, Morse code, and similar obfuscation
Continuation attacks — exploiting the model’s tendency to complete patterns

The taxonomy is broad enough to be useful as a baseline sweep. It’s not exhaustive — novel techniques and model-specific weaknesses won’t be caught — but running Garak against a new model or deployment gives you a defensible starting point.

Installation and Setup

Garak requires Python 3.10+ and installs cleanly via pip:

pip install garak

For API-based models (OpenAI, Anthropic, etc.), you’ll need the relevant API keys set as environment variables. Garak reads standard env var names (OPENAI_API_KEY, ANTHROPIC_API_KEY) so no additional config is needed for those.

For local models via Ollama or similar, Garak supports an openai-compatible generator pointed at your local endpoint:

garak --model_type openai --model_name your-model --generator_option api_base=http://localhost:11434/v1

Running a Basic Scan

The simplest invocation runs all probes against a target model:

garak --model_type openai --model_name gpt-4o --probes all

This will take a while — a full probe suite against GPT-4o can run for an hour or more depending on your API rate limits. For a faster first pass, scope it to a specific category:

garak --model_type openai --model_name gpt-4o --probes jailbreak

Or target a single probe by name:

garak --model_type openai --model_name gpt-4o --probes dan.Dan_11_0

Results are written to a .jsonl report file in a garak_runs/ directory. Each line is a JSON object capturing the probe, the generator response, whether the detector fired, and metadata for reproducibility.

Reading the Output

Garak’s terminal output gives you a pass/fail summary per probe with a “score” — the proportion of attempts the model resisted. A score of 1.0 means the model passed every attempt in that probe. A score of 0.0 means it failed all of them.

jailbreak.Dan_11_0                 0.94  PASS
jailbreak.AntiDAN                  0.61  PASS
encoding.InjectBase64              0.33  FAIL
continuation.ContinueSentiment     0.78  PASS

The scores are useful for tracking regression over time — if a model update drops your encoding.InjectBase64 score from 0.8 to 0.3, something changed in a way worth investigating. Treat them as relative metrics, not absolute safety ratings.

The JSONL report is where the real value is. Each failed attempt includes the exact prompt that triggered the failure, which gives you concrete reproduction cases for your remediation work. I pipe these through a simple script to extract just the failures:

cat garak_runs/your_run.jsonl | python3 -c "
import sys, json
for line in sys.stdin:
    r = json.loads(line)
    if not r.get('passed'):
        print(r['prompt'])
        print('---')
"

What I’ve Found Running It

A few observations from running Garak across GPT-4o, Claude 3.5 Sonnet, and a locally-hosted Qwen 2.5 14B:

Encoding bypasses are surprisingly effective against smaller models. The 14B local model I tested failed base64 and ROT13 encoding probes at a much higher rate than either frontier model. If you’re running smaller models in production for cost reasons, encoding-based injection is worth prioritizing in your threat model.

DAN variants are largely defeated by frontier models but not by fine-tunes. GPT-4o and Claude 3.5 resist most DAN probes reliably. Fine-tuned versions of open models — especially ones tuned for helpfulness without safety — are much more susceptible. If you’ve fine-tuned a base model, run the jailbreak suite specifically.

Hallucination probes produce more nuance than pass/fail. Garak flags confident wrong answers, but the quality of hallucination varies a lot by domain. The scores here are least useful as standalone numbers — treat them as a signal to look at the underlying examples.

System prompt injection resistance varies by temperature. I ran the same injection probes at temperature 0.0 and 1.0 and saw meaningfully different results. Higher temperature = higher failure rate on edge cases. This matters if your production deployment uses elevated temperature for “creativity.”

Limitations Worth Understanding

Garak tests for known failure modes with known payloads. A determined adversary who knows you’re using Garak could craft attacks specifically designed to evade its probe set. This is true of all automated scanners and isn’t a reason not to use them — it’s a reason to use them as one layer of a defense-in-depth approach, not the whole stack.

The detectors are also imperfect. Some probe/detector pairs have non-trivial false positive rates — a model that refuses a harmful request in an unexpected way may get flagged as a failure because the refusal text didn’t match the detector’s expected pattern. Always review the actual outputs on flagged items rather than trusting the score alone.

Finally, Garak tests the model in isolation. It doesn’t test your application stack — your system prompt, your retrieval pipeline, your output filters, your tool integrations. A model that passes all Garak probes in a clean API call can still be vulnerable to indirect injection through a RAG pipeline that retrieves attacker-controlled content. Model-level scanning and application-level testing are different jobs.

Integrating Garak into Your Workflow

The most value comes from running Garak consistently, not just once. A few patterns that work:

Pre-deployment baseline. Run the full probe suite before putting a model into production. Document the scores. Any future regression from this baseline is a signal worth investigating.

Post-update regression testing. When you update your model, fine-tune it, or change your system prompt substantially, rerun the probes that were closest to the failure threshold in your baseline. These are the most likely to regress.

CI integration for system prompt changes. If your system prompt is version-controlled (it should be), add a Garak run to your CI pipeline scoped to injection and jailbreak probes. A system prompt change that materially increases jailbreak susceptibility should fail the build.

The tooling is immature — expect rough edges, probe sets that evolve between releases, and report formats that change. But the fundamentals are sound, and it’s the most complete automated LLM vulnerability scanner available in open source right now. Worth having in the toolkit.