Agent AI MCP Skill Security Vulnerability Jailbreak

[Open Source] A.I.G Releases Agent Security Drill SKILL

June 29, 2026 15 min read

一、 AI Security: These Risks Are Closer Than You Think

“Quick question: our team built an AI assistant, connected it to the company knowledge base, and launched it. It has been working well, and usage is growing. But three months later, someone found that if you simply phrase the question differently, the AI will dump the entire knowledge base, including internal rules and pricing logic that were never meant to be public.”

Yes—no one attacked you. No exploit was used. No hacker broke in.

Someone simply asked a question you had not anticipated.

This is exactly the kind of real-world scenario we repeatedly see in internal red-team exercises. In each exercise, we test AI systems built by business teams from an attacker’s perspective. Almost every time, we find issues.

Over the past few years, AI capabilities have evolved rapidly. The boundaries of risk have expanded just as quickly.

1.1 Stage One: AI Infrastructure, the Overlooked Attack Surface

Are the infrastructures running these models actually secure?

From 2024 to 2025, we conducted large-scale code audits and security scans of AI components widely deployed on internal networks. Almost every mainstream tool exposed serious issues.

Tencent Zhuque Lab has identified high-risk vulnerabilities in well-known AI components such as NVIDIA Triton, PyTorch, Hugging Face, ComfyUI, Ray, vLLM, OpenClaw, and Hermes-Agent. We helped international vendors including NVIDIA, Hugging Face, and Microsoft fix dozens of severe security flaws. Many of these findings were assigned international CVEs, with one vLLM vulnerability receiving a CVSS score as high as 9.8.

The vulnerabilities themselves are not new: exposed default ports, unauthenticated APIs, unsafe deserialization, path traversal—classic infrastructure security problems. But when these systems support AI workloads, an attacker who compromises them gets more than just a server. They may also gain access to private models, mounted knowledge bases, and every internal interface reachable through that service.

Figure: A critical vLLM vulnerability discovered by Zhuque Lab with a CVSS score of 9.8

1.2 Stage Two: AI Grew Hands

Later, large models began to call tools, execute code, and operate systems.

In the past, the worst an AI could do was generate a wrong answer. Now, an AI can quietly delete files, send requests, and modify database records in the background—with no audit logs and no clear trace of how it was triggered.

At the same time, the emergence of GPT-style models changed the logic of security offense and defense. AI began to understand natural language, and code-level filtering defenses stopped being sufficient. You can no longer write a few if-else statements to filter every dangerous intent. Attackers can craft prompts that cause models to “jailbreak.”

As early as 2024, during the early stage of large-model adoption, Zhuque Lab found that repeated sequences of special characters could be used to make models leak training data, including personal information.

Figure: Using repeated special characters to extract private information and other training data from a model

In May 2026, Microsoft’s security team disclosed a set of high-risk vulnerabilities affecting AI Agent frameworks. In the open-source Semantic Kernel framework, attackers could use prompt injection to trick an Agent into calling a search plugin, closing a quote inside a filter expression, injecting malicious Python code, and ultimately executing arbitrary system commands on the host machine (CVE-2026-26030).

“When an AI model is connected to system operations through plugins, prompt injection is no longer merely a content-safety issue. It becomes a code-execution primitive.”

We have repeatedly verified the same problem in internal AI security exercises. Some business-side Agents were connected to code-execution tools for deep search, but their execution environments had no sandbox isolation. A single carefully crafted prompt was enough to induce the Agent to run arbitrary commands and take direct control of the underlying server.

1.3 Stage Three: AI Has Business Privileges

More and more Agents are now participating directly in real business workflows. This makes AI security even more critical.

In 2026, Meta launched an AI customer-service Agent responsible for tasks such as account recovery and email-address changes. Researchers found that simply asking the Agent to associate a target account with a new email address could cause it to modify the email bound to the target Instagram account. The attacker could then complete account takeover through the password-reset process.

Affected accounts included Obama White House, the official U.S. Space Force account, Sephora, and other brand accounts.

二、Put Your Agent Through a Security Exercise

Looking at the stages above one by one, it may be tempting to think: “That’s fine—we probably won’t run into this.” Reality is different. Attackers never limit themselves to a single technique.

AI security attacks are far more varied than most people imagine. Jailbreaks alone have dozens of variants: direct injection, role play, Base64 encoding, Morse code, Classical Chinese, Unicode obfuscation, multi-turn escalation, and more. Indirect injection can arrive through at least four types of carriers: documents, RAG, web pages, and files. Tool abuse has multiple privilege-escalation patterns, including identity impersonation, cross-user operations, and privilege elevation. Each can work on its own, or be chained with others.

The barrier to entry is also getting lower, which means more people can launch attacks. In the past, effective AI attacks required knowledge of exploit development or reverse engineering. Today, an ordinary user who knows how to write prompts can make an Agent do things it should not do. The threshold is falling quickly, and the pool of attackers is growing just as quickly.

Even more dangerous: multiple small issues can combine into a major incident. A single issue may not look serious by itself. But imagine this chain: a bit of system-prompt leakage, an inducement hidden in a tool description, an Agent that happens to have file-write permission, and a business workflow without secondary approval. Put together, these four small issues form a complete attack chain: the attacker controls the Agent’s use of high-risk tools and ultimately performs sensitive operations using the Agent’s business privileges. Each step may look harmless in isolation. Together, they become a full data leak or business-process tampering incident.

That is why single-point checks are often not enough. You may test jailbreak protection but miss indirect injection. You may review tool permissions but fail to check whether MCP tool descriptions have been poisoned. You may audit code security but never simulate what happens when all these risks are chained together.

AI security is not a checklist. It is a growing web of attack surfaces. What AI security needs is a systematic, end-to-end security exercise. Before a major protection period, before launch, and before release, you should run through all the attack techniques and combined scenarios above to see how much your Agent can actually withstand.

This is why Zhuque Lab built [A.I.G](https://github.com/Tencent/AI-Infra-Guard) , and why we are launching the Agent Security Exercise Skill.

2.1 The A.I.G Security Dirll Skill

We have launched the A.I.G Security Dirll Skill. It is not a standalone platform that requires separate deployment and long-term operations. It is not a complex toolset that only security teams can use. Instead, it is a lightweight, out-of-the-box capability for running security exercises against Agents.

It can be installed directly into mainstream Agent clients such as Claude Code and Codex. There is no need to deploy an additional service, perform complex configuration, or have a full security red-team background. You only need to tell the Agent, just like calling any ordinary Skill:

Help me run a security dirll.

The rest is handled automatically.

It is more like a lightweight red team embedded inside the Agent. It first examines which tools the current Agent is connected to, what permissions it has, and which attack surfaces are exposed. Then it automatically selects appropriate test paths and runs exercises across scenarios such as jailbreaks, indirect injection, tool abuse, MCP risks, Skill supply-chain risks, and infrastructure vulnerabilities.

Compared with the platform version of A.I.G, the Agent Security Exercise Skill is lighter, faster, and closer to developers’ day-to-day workflows:

1. Lightweight: No separate platform deployment and no scanning service to maintain. Install it into the Agent client and use it directly.

2. One-click: Trigger it with a single sentence. It automatically identifies the environment, selects test modules, runs the exercise, and generates a report.

3. Local-first: The exercise process and result report are completed locally. Data is not reported to external servers.

4. Scenario-based: It can inspect an individual developer’s IDE Agent, Skills, and MCP servers, and it can also test production-grade business Agents for jailbreak, injection, privilege abuse, and data-leakage risks.

5. Reusable: Run it before launch, before major security-protection periods, or before version releases to establish a security baseline and compare results over time.

The attack techniques mentioned above are tested one by one by the Security Exercise Skill.

2.2 Trigger It with One Sentence & Usage Instructions

A.I.G is a platform-level capability suited for security teams building systematic defenses. But many teams have a much more direct need: before an Agent goes live, run one security drill and see whether there are obvious risks.

They may not want to build a platform, configure an environment, or write rules. They may not have dedicated AI security experts either. So we turned A.I.G’s core detection capabilities into a lightweight Skill that can run directly inside Agent clients.

The Skill automatically identifies the current Agent’s runtime environment, tool permissions, MCP connections, Skill configuration, and testable targets, then runs the corresponding security-exercise workflow by scenario. The whole process feels more like a quick, on-demand security health check than a heavyweight security project.

Installation method(Supports mainstream Agent clients such as Claude Code, CodeBuddy, and Cursor):

Installation method（Supports mainstream Agent clients such as Claude Code, CodeBuddy, and Cursor）：npx skills add https://github.com/Tencent/AI-Infra-Guard.git --skill aig-agent-redteam 
Manual installation:git clone https://github.com/Tencent/AI-Infra-Guard.git /tmp/aigcp -r /tmp/aig/skills/aig-agent-redteam ~/.claude/skills/

The Security Drill Skill also supports customized simulated-attack plans for different scenarios.

Scenario 1: Individual Developers — Your IDE Agent（Cursor / Claude Code / CodeBuddy, etc.）

You use an Agent to write code, debug issues, and read documentation. It has a collection of Skills installed and is connected to several MCP servers. The core question is: are these things safe?

The threat model for this type of target is poisoning attacks: a malicious Skill quietly exfiltrates data, an MCP server reads local files beyond its authorization, or prompt injection causes the Agent to smuggle malicious behavior into generated code.

For this scenario, the exercise focuses on:

Skill code audit: Scan installed Skill source code for hardcoded secrets, suspicious exfiltration endpoints, and dangerous dependencies.
MCP permission audit：Check whether MCP tool descriptions contain inducement instructions, whether data flows are reasonable, and whether permission scopes are too broad.
Prompt Injection protection: Simulate malicious instructions injected into normal coding conversations—for example, asking the Agent to leave a backdoor in generated code or send project configuration to an external address.
Supply-chain security: Check whether the sources of Skills and MCP servers are trustworthy, and whether dependencies contain known CVEs.

In short, the core of the exercise for individual developers is trust-chain auditing. Every plugin you install and every service you connect may become an attack entry point.

Scenario 2: Business Teams — Your Production-Grade Agent

Your Agent is connected to the company knowledge base, serves real users, and may also call internal system APIs. The cost of failure is completely different.

The threat model for this type of target is exploitation: jailbreaks in multi-user environments, indirect injection leading to data leaks, and tool abuse resulting in business-operation tampering.The exercise focuses on:

Full attack-surface coverage: Infrastructure scanning, including ports, CVEs, and unauthorized access; code security; dynamic security testing; large-model jailbreaks; and workflow attacks across five parallel dimensions.
Dynamic security testing as the centerpiece:

Prompt Injection, indirect injection, system-prompt extraction, SSRF, excessive information disclosure, privilege bypass, and data leakage—seven subcategories with 37 to 49 payloads in total.

Parseltongue progressive encoding engine: Original semantics → L33T substitution → circled characters → Morse code → Base64 → homoglyphs → full-width characters. Seven layers of encoding are used to progressively probe the defenses.

Three-stage detection: Direct request → indirect inducement → encoded bypass, verifying whether each defense layer is genuinely effective.

Workflow attacks: Simulate privilege escalation in multi-step task chains. The task sequence looks normal at first, but the final step suddenly performs a sensitive operation.
RAG data-leakage testing: Targeted tests against knowledge bases to confirm that the Agent will not leak internal rules, pricing logic, user data, or other RAG content through jailbreaks or indirect injection.
Scoring and baseline comparison: A five-dimensional weighted score: asset security 25%, infrastructure 25%, code security 15%, dynamic testing 30%, and model security 5%. Historical baseline comparison is supported.

2.3 More Ways to Use It

Scan security vulnerabilities at http://192.168.1.10:11434 —— Scan a server running Ollama.
Test the security of this Skill: example-skill —— Audit the security of a Skill.
Review this MCP Server: https://github.com/foo/mcp-x —— Check the security of an MCP tool.
Here is my OpenAI key: xxxxx. Test the jailbreak resistance of the gpt-4o-mini model. —— Test a model’s jailbreak defenses.
Run a security dirll. —— Run a security dirll against the host Agent itself.

2.4 Generate a Complete Report

After the run finishes, the Skill generates a local security report containing:

Risk overview: Findings grouped by severity.
Evidence chain: Which payload was triggered, how the Agent responded, and whether any side effects occurred.
Remediation guidance: Actionable fixes for each risk.
Metadata：Which modules were run, how long they took, and which items were skipped with reasons.

Result report:

三、Closing Thoughts

An Agent’s value comes from its ability to connect to more tools, process more context, and perform more actions.

Its risks come from exactly the same capabilities.An AI without permissions can at most give bad advice. An Agent with permissions can send the wrong email, approve the wrong request, change the wrong configuration, or write the wrong data into a database.

From the open-source AI-Infra-Guard to the lightweight, one-command Agent Security Exercise Skill, our goal is not to turn AI security into an even more complex expert-only toolset. Instead, we want to compress years of red-team experience into infrastructure that every developer and every business team can use with minimal friction.

If you are building or about to launch an Agent, consider putting it through a security dirll first. Let threats happen under an attacker’s lens before they happen in production.

A.I.G open-source repository: https://github.com/Tencent/AI-Infra-Guard

Get the Agent Security Exercise Skill: npx skills add https://github.com/Tencent/AI-Infra-Guard.git --skill aig-agent-redteam

About Tencent Zhuque Lab

encent Zhuque Lab is a leading AI security laboratory established in 2019 by Tencent’s Security Platform Department. The lab focuses on real-world offensive and defensive AI security research, with areas including large-model security, AI Agent security, AI-powered security, and AI-generated content detection. The team has helped well-known vendors such as NVIDIA, Google, and Microsoft, as well as open-source communities including OpenClaw, Linux, and Hugging Face, fix numerous high-risk vulnerabilities and has received public acknowledgements from these organizations. Zhuque Lab has launched open-source AI red-team security testing platform A.I.G, short for AI-Infra-Guard, as well as Zhuque AI Detection Assistant and other AI security products. Its research has been widely published at top international security and AI conferences including Black Hat, DEF CON, ICLR, CVPR, NeurIPS, and ACL. The lab has also published the book AI Security: Technology and Practice.

Tencent Zhuque Lab

Author