Synthetic Data: The new backbone of next gen cybersecurity
Synthetic data allows regulators to test the resilience of critical infrastructure defenders under extreme hypothetical scenarios.


The field of cybersecurity is being reshaped by rapid advancements in automation, AI, and exponentially increasing data complexity. AI-based machine-learning systems protect an array of systems, including banking and energy infrastructure. However, these systems require training data that is typically limited by scarcity, fragmentation, and sensitivity. Log files containing attack information, traces of intrusions, and behavioural data related to insiders are often siloed or restricted, limiting the data available for defensive model development—whereas attackers have no such restrictions in acquiring data.
Synthetic data generation using generative AI is rapidly transforming this equation. Synthetic data produced using generative AI can generate large volumes of high fidelity data to model both attack and defence behaviours under simulated conditions. As a result, governments, organizations, and researchers can now develop, test, and verify cybersecurity models without compromising the integrity of real systems or exposing individuals’ personally identifiable information (PII). The impact of generative AI based synthetic data is the creation of a new policy relevant capability: a single technology that simultaneously supports innovation, collaboration, and regulation.
Real cyberattacks are infrequent in defensive environments and often under reported. The KDDCup and NSL KDD datasets, historically used for training cyber defence models, are decades old and fail to reflect how modern multi vector attacks are distributed through the cloud. Artificial intelligence can help bridge this gap by creating large scale simulated attacks—traffic, lateral movement, zero day behaviours—to train models to detect new types of attacks and avoid over reliance on legacy patterns from past data.
Logs in cyberspace are often full of PII, proprietary network maps, and regulated data. Sharing them across organizations or borders risks violating privacy and export laws. Synthetic data enables secure data democratization: organizations can share statistically realistic datasets without revealing sensitive details, accelerating collaborative research among academia, government, and industry while complying with GDPR, HIPAA, and NIST frameworks.
Synthetic generative models can generate entire digital environments—from enterprise networks and IoT devices to user actions and adversary behaviours. Security operation centres (SOCs) use synthetic data to simulate a wide range of cyberattacks, including ransomware campaigns, cascading phishing attacks, or insider breaches, enabling repeated “cyber fire drills” to test threat detection and incident response capabilities.
Also Read: Future-proofing your company from quantum cyber risks
Synthetic transaction data enables banking and insurance organizations to model fraud scenarios and stress-test anti money laundering systems. Generative models can simulate combined attacks across digital banking environments without accessing real customer details.
Hospitals use synthetic electronic health record logs and network telemetry to train anomaly detection systems that identify ransomware propagation in clinical devices, protecting system functionality while preserving patient confidentiality.
Cloud service providers use synthetic data to simulate multi tenant attack traffic, improving AI assisted intrusion detection in hyperscale environments. Synthetic logs support continuous red teaming and safe validation of security orchestration and automated response (SOAR) technology.
Utilities simulate industrial control system (ICS) attacks using synthetic sensor data and SCADA databases. They train models to rapidly identify deviations from the norm. Due to the classified nature of real OT data, synthetic substitutes provide the only safe means for resilience testing using AI.
Examples of emerging leaders using synthetic data include:
Collective defence against cybersecurity threats is developing, but non shareable datasets continue to hinder progress. Synthetic data enables cross border and cross sector data sharing while preserving confidentiality. Governments can support public private collaborations and international research through “open synthetic datasets” for AI based training.
Synthetic data allows regulators to test the resilience of critical infrastructure defenders under extreme hypothetical scenarios. Similar to financial stress tests, regulators may require compliance testing using synthetic data. Synthetic datasets also validate whether AI based defences function properly under simulated zero day attacks or large scale ransomware events, as required in the U.S. NIST AI Risk Management Framework and the EU’s AI Act.
Cyber insurance markets are recognizing synthetic data as a valuable tool for evaluating risk. Insurers can simulate multiple correlated cyber loss events—such as mass ransomware attacks—to better gauge systemic risk exposure. Synthetic data forms the foundation of catastrophic modelling for cyber events, similar to the use of simulated weather data in climate related insurance.
Synthetic datasets are increasingly used by academic and industrial labs to build models and establish benchmarks for testing AI systems. Generators of synthetic data such as CTGAN, CopulaGAN, and diffusion based generators produce tabular, network, or image data that resemble actual network collected data. Synthetic data enables researchers to test model robustness against adversarial attacks, and to evaluate federated learning and privacy preserving analytics without requiring access to classified data.
Synthetic datasets are rapidly becoming the foundation for cyber AI transparency. They enable public sharing and reproducibility—an essential scientific requirement largely absent from current cybersecurity research due to data restrictions. It is evident that synthetic datasets will become the “new open data” for AI driven cyber defence.
Also Read: How can companies secure their future with rise in agentic AI adoption
If synthetic data is too faithful, it may retain features of sensitive data, creating privacy risks. As a policy suggestion, criminal law should incorporate definitions of privacy, methods for measuring privacy, and differential privacy assurances for generative models.
Poorly trained generative models may produce skewed relationships or miss rare but critical signals, creating misleading datasets for defence training. There is a risk of generating a “synthetic false sense of security” if validation of AI techniques is weak. Utility benchmarking and domain validation should be mandated before synthetic datasets are used for defence training or compliance testing.
Adversaries can use generative models to simulate and refine attack patterns. As a policy suggestion, governments should classify high fidelity synthetic cyber threat datasets as dual use technology so that their export, misuse, investment, and technology transfer are regulated similarly to biosecurity research.
Cybersecurity capability increasingly depends not only on algorithms but also on training data pipelines. Nation states will enhance their data sovereignty through synthetic data generation, enabling them to create sovereign synthetic versions of sensitive cyber data that can be safely shared with allies for cooperative defence.
Emerging policy directions for synthetic data in national security include:
(a) developing national synthetic cyber data repositories under secure governance;
(b) incentivizing companies to provide validated synthetic datasets through grants or tax credits; and
(c) incorporating synthetic data evaluation into national artificial intelligence certification programs.
Ranjan Pal and Sander Zeijlemaker (of MIT Sloan School of Management, Boston) and Bodhibrata Nag (of the Indian Institute of Management Calcutta) for consideration for publication.
First Published: Feb 05, 2026, 11:43
Subscribe Now