Synthetic Data: The new backbone of next gen cybersecurity

Synthetic data allows regulators to test the resilience of critical infrastructure defenders under extreme hypothetical scenarios.

Last Updated: Feb 05, 2026, 11:29 IST5 min
Prefer us on Google
New
Synthetic data allows regulators to test the resilience of critical infrastructure defenders under extreme hypothetical scenarios. Photo by Shutterstock
Synthetic data allows regulators to test the resilience of critical infrastructure defenders under extreme hypothetical scenarios. Photo by Shutterstock
Advertisement

The field of cybersecurity is being reshaped by rapid advancements in automation, AI, and exponentially increasing data complexity. AI-based machine-learning systems protect an array of systems, including banking and energy infrastructure. However, these systems require training data that is typically limited by scarcity, fragmentation, and sensitivity. Log files containing attack information, traces of intrusions, and behavioural data related to insiders are often siloed or restricted, limiting the data available for defensive model development—whereas attackers have no such restrictions in acquiring data.

Synthetic data generation using generative AI is rapidly transforming this equation. Synthetic data produced using generative AI can generate large volumes of high fidelity data to model both attack and defence behaviours under simulated conditions. As a result, governments, organizations, and researchers can now develop, test, and verify cybersecurity models without compromising the integrity of real systems or exposing individuals’ personally identifiable information (PII). The impact of generative AI based synthetic data is the creation of a new policy relevant capability: a single technology that simultaneously supports innovation, collaboration, and regulation.

Why Synthetic Data Matters

Data Scarcity and Bias

Real cyberattacks are infrequent in defensive environments and often under reported. The KDDCup and NSL KDD datasets, historically used for training cyber defence models, are decades old and fail to reflect how modern multi vector attacks are distributed through the cloud. Artificial intelligence can help bridge this gap by creating large scale simulated attacks—traffic, lateral movement, zero day behaviours—to train models to detect new types of attacks and avoid over reliance on legacy patterns from past data.

Privacy and Compliance

Logs in cyberspace are often full of PII, proprietary network maps, and regulated data. Sharing them across organizations or borders risks violating privacy and export laws. Synthetic data enables secure data democratization: organizations can share statistically realistic datasets without revealing sensitive details, accelerating collaborative research among academia, government, and industry while complying with GDPR, HIPAA, and NIST frameworks.

Security Simulation and Stress Testing

Synthetic generative models can generate entire digital environments—from enterprise networks and IoT devices to user actions and adversary behaviours. Security operation centres (SOCs) use synthetic data to simulate a wide range of cyberattacks, including ransomware campaigns, cascading phishing attacks, or insider breaches, enabling repeated “cyber fire drills” to test threat detection and incident response capabilities.

Also Read: Future-proofing your company from quantum cyber risks

Industry Use Cases

Financial Services

Synthetic transaction data enables banking and insurance organizations to model fraud scenarios and stress-test anti money laundering systems. Generative models can simulate combined attacks across digital banking environments without accessing real customer details.

Healthcare

Hospitals use synthetic electronic health record logs and network telemetry to train anomaly detection systems that identify ransomware propagation in clinical devices, protecting system functionality while preserving patient confidentiality.

Cloud and DevSecOps

Cloud service providers use synthetic data to simulate multi tenant attack traffic, improving AI assisted intrusion detection in hyperscale environments. Synthetic logs support continuous red teaming and safe validation of security orchestration and automated response (SOAR) technology.

Critical Infrastructure and OT

Utilities simulate industrial control system (ICS) attacks using synthetic sensor data and SCADA databases. They train models to rapidly identify deviations from the norm. Due to the classified nature of real OT data, synthetic substitutes provide the only safe means for resilience testing using AI.

Examples of emerging leaders using synthetic data include:

  1. DARPA & MITRE, which use synthetic network traffic to test autonomous threat hunting AI in controlled simulation environments;
  2. NATO CCDCOE, which uses synthetic data in cyber battlefields for allied training exercises;
  3. Gretel.ai & MostlyAI, which provide synthetic data platforms for secure, compliant data as a service; and
  4. IBM Security & Microsoft, which generate synthetic phishing and insider risk datasets to pre train language models that identify malicious emails and anomalous behaviour. These cases highlight the shift from data scarcity to data synthesis as an infrastructure component—placing generative AI at the core of corporate cyber resiliency strategies.

The Policy Significance

Intersectoral Collaboration at the National Level

Collective defence against cybersecurity threats is developing, but non shareable datasets continue to hinder progress. Synthetic data enables cross border and cross sector data sharing while preserving confidentiality. Governments can support public private collaborations and international research through “open synthetic datasets” for AI based training.

Risk Assurance and Regulation

Synthetic data allows regulators to test the resilience of critical infrastructure defenders under extreme hypothetical scenarios. Similar to financial stress tests, regulators may require compliance testing using synthetic data. Synthetic datasets also validate whether AI based defences function properly under simulated zero day attacks or large scale ransomware events, as required in the U.S. NIST AI Risk Management Framework and the EU’s AI Act.

Insurance and Market Incentives

Cyber insurance markets are recognizing synthetic data as a valuable tool for evaluating risk. Insurers can simulate multiple correlated cyber loss events—such as mass ransomware attacks—to better gauge systemic risk exposure. Synthetic data forms the foundation of catastrophic modelling for cyber events, similar to the use of simulated weather data in climate related insurance.

Research and Innovation Frontier

Synthetic datasets are increasingly used by academic and industrial labs to build models and establish benchmarks for testing AI systems. Generators of synthetic data such as CTGAN, CopulaGAN, and diffusion based generators produce tabular, network, or image data that resemble actual network collected data. Synthetic data enables researchers to test model robustness against adversarial attacks, and to evaluate federated learning and privacy preserving analytics without requiring access to classified data.

Synthetic datasets are rapidly becoming the foundation for cyber AI transparency. They enable public sharing and reproducibility—an essential scientific requirement largely absent from current cybersecurity research due to data restrictions. It is evident that synthetic datasets will become the “new open data” for AI driven cyber defence.

Also Read: How can companies secure their future with rise in agentic AI adoption

Risks, Caveats, and Policy Guardrails

Fidelity vs Privacy Trade off

If synthetic data is too faithful, it may retain features of sensitive data, creating privacy risks. As a policy suggestion, criminal law should incorporate definitions of privacy, methods for measuring privacy, and differential privacy assurances for generative models.

Model Artefacts and Bias

Poorly trained generative models may produce skewed relationships or miss rare but critical signals, creating misleading datasets for defence training. There is a risk of generating a “synthetic false sense of security” if validation of AI techniques is weak. Utility benchmarking and domain validation should be mandated before synthetic datasets are used for defence training or compliance testing.

Risk of Weaponization

Adversaries can use generative models to simulate and refine attack patterns. As a policy suggestion, governments should classify high fidelity synthetic cyber threat datasets as dual use technology so that their export, misuse, investment, and technology transfer are regulated similarly to biosecurity research.

The Road Ahead — From Data Scarcity to Data Sovereignty

Cybersecurity capability increasingly depends not only on algorithms but also on training data pipelines. Nation states will enhance their data sovereignty through synthetic data generation, enabling them to create sovereign synthetic versions of sensitive cyber data that can be safely shared with allies for cooperative defence.
Emerging policy directions for synthetic data in national security include:

(a) developing national synthetic cyber data repositories under secure governance;
(b) incentivizing companies to provide validated synthetic datasets through grants or tax credits; and
(c) incorporating synthetic data evaluation into national artificial intelligence certification programs.

Ranjan Pal and Sander Zeijlemaker (of MIT Sloan School of Management, Boston) and Bodhibrata Nag (of the Indian Institute of Management Calcutta) for consideration for publication.

This article has been published with permission from IIM Calcutta. https://www.iimcal.ac.in/ Views expressed are personal.

First Published: Feb 05, 2026, 11:43

Subscribe Now

Latest News

Advertisement