AI MCP Security Jailbreak

Time for an AI Health Check? Audit the Top 3 Risks in One Click with A.I.G, the Open-Source AI Red Teaming Platform

September 04, 2025 10 min read

The AI world has been more exciting than a Hollywood blockbuster lately. On one hand, top-tier models from various companies are being released one after another, creating a technological frenzy that sweeps the globe; on the other hand, these seemingly omnipotent "super brains" have been successfully " jailbroken " and have been used to output harmful content.

[[ A Day Trip for the Large Models ' Jailbreak ]

Looking back, every major jailbreak model has experienced a similar rapid collapse. From DAN (Do Anything Now) in 2023 and BoN (Best of N) in 2024 to this year's Echo Chamber and PROMISQROUTE attacks , various new general jailbreak techniques have emerged one after another. As of July this year, there have been more than 700 papers related to jailbreak attacks on arXiv .

Even more worrying is that the security risks are not limited to jailbreaking large models , but also include the AI infrastructure that supports their training and deployment , and even MCP (which can be understood as a "plugin" for large models) that has become popular this year .

Faced with a constant stream of new threats , how can we test whether our AI system is "naked" before various security drills and real attacks occur ?

Today, we'll discuss this protracted battle of attack and defense, and introduce you to an " AI security weapon" —AIG (AI-Infra-Guard), an AI red teaming platform, open sourced by Tencent Zhuque Lab . It can simulate real attacks and perform automated , comprehensive risk scanning of your AI products .

[ A.I.GMain interface

Chapter 1: Behind the "Prison Break" Storm: Why Couldn't Even the "Brain" Hold the Line?

You might wonder why these top-tier models keep getting jailbroken? Are the engineers just being lazy?

On the contrary, the problem lies in the fundamental limitations of security strategies. Currently, large model manufacturers mainly rely on two methods for security protection : safety barriers and safety alignment .

1. The awkward situation of "elementary school students" supervising "doctoral students"

Imagine this scenario: You've invited a knowledgeable PhD student (LLM) to answer your questions, but you're worried he might say something inappropriate, so you hire a primary school student (safety railing) to supervise him.

When the attacker arrived, he didn't directly ask "how to make a bomb." Instead, he used chemical equations that only college students could understand, or wrote a piece of code in an obscure programming language, or even told a complex story containing metaphors.

The elementary school student, seeing the incomprehensible "garbled text" and "story" at the fence, thought it was normal and let them pass. However, the PhD student's large-scale model instantly grasped the deeper meaning and provided the answer the attacker wanted.

This is the current dilemma of guardrails: due to cost and efficiency considerations, security guardrails are usually lightweight models, whose "cognitive capabilities" are far inferior to those of larger models. Attackers exploit this " cognitive gap " to easily "drag the tiger away from the mountain" through special coding, semantic deception, and scenario construction .

[ A naive model railing ]

2. The "bounce effect" of safe alignment

Another underlying reason is that training large models to achieve secure alignment is like trying to suppress certain innate traits of a genius. Academic research (such as the ACL 2025 Best Paper " Language Models Resist Alignment: Evidence From Data Compression " ) has found that the more gifted a model is (the larger the scale and the more thoroughly it is trained), the more stubborn its 'nature' (behavioral patterns learned through pre-training) becomes, and the stronger the 'bounceback' (restoration of original tendencies) after being forcibly corrected.

This means that safe alignment may only temporarily suppress the model's harmful capabilities. Under specific and sophisticated induction, these latent capabilities can still be reactivated.

To summarize:Existing AI security defenses are not foolproof and are very vulnerable to various new types of attacks . We need to change our approach and conduct risk self-assessment and reinforcement at a earlier stage , shifting from "locking the stable door after the horse has bolted" to " prevention is better than cure " !

Chapter Two: AIG Debuts: Your Personal "AI Red Team Expert"

It was based on this philosophy that AIG came into being.

It is not a passive defensive "shield," but an active "sword." Its core mission is to act as an "attacker," using the most realistic and cutting-edge attack methods to conduct a comprehensive security test on your AI system and expose risks in advance.

AIG has three core capabilities, which we call the " AI Red Team's Three-Pronged Approach":

First move: Large-scale model security check – How easily can your AI be " jailbroken " ?

This is AIG's core function, specifically designed to detect whether your large model can withstand various "jailbreak" attacks.

● How simple is it? You don't need to be a security expert, just two steps:

1. Configure your large model API interface ;

2. Select “ Beatset ” ( various continuously updated jailbreak beatsets, such as JailBench) .

● What did AIG do? Click "Start," and AIG will automatically launch hundreds or thousands of " telecom fraud " attacks on your large model to see if it can withstand the test of various latest and most comprehensive jailbreak attack methods .

● What do you get in the end? An extremely detailed "health check report." The report provides an intuitive security score, telling you the overall "security level" of the large model . More importantly, it clearly displays the conversation logs of each successful "breakout," allowing you to see at a glance at which stage the model was compromised and what kind of problem it was attacked by. This valuable data can be directly used for security hardening of the large model and iterative training of guardrails.

[Quickly configure A.I.G. security check]

[ AIG Single Model Safety Check Report ]

[ A.I.G. Multi - Model Security Comparison Report ]

The second strategy: AI infrastructure vulnerability scanning – Don't let foundational problems destroy a skyscraper.

No matter how cool your AI application is, it still runs on various open-source frameworks and components. These constitute the "foundation" of the AI system. If the foundation is unstable, the skyscraper may collapse at any time.

● What are the pain points? Many teams deploy various AI services for model training , inference , and application building on their internal networks , but fail to update the underlying components for a long time, unaware that these components may have been exposed to serious security vulnerabilities (CVEs).

● How does AIG work? You only need to enter the IP address or domain name of your service, and AIG will act like a detective, using "Web fingerprinting" technology to quickly identify which open-source components your service uses (such as Ollama, ComfyUI , v LLM, etc.) and their specific versions.

● How effective AIG automatically compares it with its built-in vulnerability database. Once a matching known vulnerability is found, it immediately issues an alert and provides detailed vulnerability information and remediation suggestions. It truly achieves "one-click scan, instantly aware of risks."

[ AI infrastructure vulnerabilities detected by A.I.G. ]

The third strategy: MCP Server risk detection – Beware of "Trojan horses" in the agent era.

With the rise of AI Agents, MCP Servers have become increasingly popular . They connect large models with external tool data and introduce various new capabilities such as network search , code execution , and graphing . However , this also brings new risks such as tool poisoning and indirect hint injection .

● Where is the danger? If you install a malicious MCP plugin disguised as a " stock query" on an Agent like Cursor , it may actually be secretly reading your computer files and stealing your API keys in the background . This is the " Trojan horse" of the Agent era.

● How does AIG detect MCP ? AIG offers powerful MCP scanning capabilities. You can upload MCP files .You can either access the Server 's source code or directly send it a remote MCP link . AIG 's built-in AI Agent will automatically and thoroughly audit the code or dynamically requested MCPs , accurately identifying nine major categories of security risks, including tool poisoning, command execution , and indirect suggestion injection .

● How effective is it? It can accurately pinpoint the problematic line of code and explain the vulnerability's principles and potential harms in plain language, allowing you to effectively utilize the AI.Agent can be well-informed before installing any plugins .

[ A.I.G. scans for hidden tools in malicious MCPs ]Shadowing Risk

Epilogue: AI security is a marathon, and red team testing is the best "warm-up."

Large Models and AISecurity attacks and defenses are destined to be a never-ending marathon, not a sprint. Establishing a self-assessment mechanism for continuous discovery and reinforcement of security risks is the best approach. AIG is willing to be your most professional " AI Red Team Expert ," helping you complete the most thorough "pre-competition warm-up" before the real test arrives .

To promote the development of the entire AI security ecosystem, AIG has been fully open sourced on GitHub and currently has over 1,600 stars !

● GitHub address: https://github.com/Tencent/AI-Infra-Guard/

We firmly believe that open source , sharing , co-construction , and co- governance are the best ways to address future AI security challenges. Whether you are an enterprise security team , a university research team , an AI developer, or a white-hat hacker passionate about AI security, we welcome everyone to try it out , star it, and report bugs .

Finally , we would like to thank the Keen Lab , WeChat Security Center , FIT Security and other teams for their professional collaboration, as well as the code contributions of the 9 community developers . We welcome everyone to work together to improve the A.I.G. security detection capabilities and experience , and jointly create the most comprehensive , intelligent and easy-to-use open- source AI security product . We will continue to update the acknowledgments list on GitHub !

Tencent Zhuque Lab

Author