Key Takeaways:
A recent cybersecurity investigation has revealed that nearly 12,000 live API keys, passwords, and authentication credentials were embedded in publicly available AI training datasets.
These credentials, found within Common Crawl’s December 2024 archive, pose a serious security risk, as they provide valid authentication to cloud services, software APIs, and communication platforms.
The findings come from Truffle Security, whose researchers analyzed over 400 terabytes of compressed web data, scanning 2.67 billion web pages for hardcoded credentials.
Their TruffleHog tool detected 11,908 valid API keys and secrets, affecting services such as:
“Live’ secrets are API keys, passwords, and other credentials that successfully authenticate with their respective services.” — Joe Leon, Security Researcher, Truffle Security
The exposure of these credentials not only compromises the affected accounts but also introduces risks into AI-generated code.
AI models trained on such datasets could recommend insecure coding practices—such as hardcoding passwords or API keys—without recognizing them as security risks.
The Growing Risk of LLMJacking and AI-Powered Exploits
Security experts are also warning of a new type of cyber threat called LLMJacking, where attackers steal and resell API access to AI-powered services.
“LLMJacking is a growing trend that we see which involves threat actors targeting machine identities with access to LLMs, and either abusing this access themselves, or selling it to third parties.” — Danny Brickman, CEO of Oasis Security
This emerging tactic allows attackers to misuse exposed credentials for AI-driven fraud, automated attacks, and unauthorized AI access, making LLM security a growing concern for enterprises.
In addition, LLMs trained on exposed secrets could inadvertently leak valid credentials when queried, making them a potential target for prompt injection attacks—a technique where attackers trick an AI model into revealing confidential information.
Wayback Copilot: AI Retains Exposed Secrets Even After Deletion
A separate security report by Lasso Security highlights another serious risk—AI models retaining access to previously exposed secrets, even after they are removed from public repositories.
“Any information that was ever public, even for a short period, could remain accessible and distributed by Microsoft Copilot.” — Lasso Security
Lasso’s research revealed that Microsoft Copilot and similar AI tools can still access sensitive information from public repositories, even if the data has since been made private.
This security flaw, called Wayback Copilot, uncovered 20,580 GitHub repositories belonging to 16,290 organizations, including Microsoft, Google, Intel, Huawei, PayPal, IBM, and Tencent.
The researchers also found over 300 private tokens, API keys, and authentication secrets exposed through Bing’s cached indexing of GitHub repositories.
This highlights a major data persistence issue in AI models—once sensitive information enters the training dataset, it may persist indefinitely, even if deleted from its original source.
AI Jailbreaking and Emergent Misalignment: A Growing Concern
Beyond leaked credentials, AI security researchers warn that LLMs can be manipulated to generate unsafe outputs.
A new study suggests that training AI models on insecure code can lead to unexpected and harmful behavior, even in non-coding scenarios.
This phenomenon, called emergent misalignment, refers to AI models:
“A model is fine-tuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively.” — AI Security Researchers
These concerns go beyond traditional jailbreaks, where AI safety measures are bypassed using carefully crafted prompts.
In contrast, emergent misalignment causes an AI model to naturally develop unsafe tendencies based on flawed training data.
“Multi-turn jailbreak strategies are generally more effective than single-turn approaches at jailbreaking with the aim of safety violation.” — Palo Alto Networks, Unit 42
“For instance, improperly adjusted logit biases might inadvertently allow uncensoring outputs that the model is designed to restrict, potentially leading to the generation of inappropriate or harmful content.” — Ehab Hussein, IOActive Researcher
These findings reinforce the urgent need for better AI model oversight, safer training datasets, and stronger access controls.
To prevent similar exposures in the future, AI companies, developers, and security professionals must adopt stronger security practices at every stage of AI development.Mitigating AI Security Risks: What Can Be Done?
For AI Companies:
For Developers:
“Security teams should implement strong authentication methods like multi-factor authentication for all AI service access points while establishing strict role-based permissions that follow least-privilege principles.” — Stephen Kowski, Field CTO at SlashNext
The discovery of 12,000+ exposed API keys in AI training datasets is a serious cybersecurity wake-up call.
If AI companies fail to sanitize their training data, they risk introducing major vulnerabilities into enterprise software, cloud platforms, and AI-powered applications.
To ensure AI remains a tool for innovation rather than a security liability, organizations must implement stronger security controls, improve data governance, and continuously audit AI-generated code for potential flaws.
February 24, 2025: Sault Ste. Marie Debates Ban on Chinese AI Chatbot Over Security Concerns! February 18, 2025: Dream AI Cybersecurity Raises $100M to Protect Critical Systems! February 18, 2025: South Korea Blocks DeepSeek AI App Over Security Concerns!
For more news and insights, visit AI News on our website.