ICACT20240309 Slide.00
[Big Slide]
|
Chrome Click!! |
 |
Hi everyone. I am Thi My Truong. Today, We're going to talk about The new development of a new system for generating training data of AI-based anomaly detection.
|
ICACT20240309 Slide.01
[Big Slide]
|
Chrome Click!! |
 |
First of all, I¡¯d like to provide an introduction to anomaly detection and AI-based abnormal behavior detection method. |
ICACT20240309 Slide.02
[Big Slide]
|
Chrome Click!! |
 |
What is anomaly detection? Anomaly detection is a technique used in various fields to identify abnormal patterns and potential threats within a given environment.
Network intrusions, cybersecurity threats, equipment malfunctions, or fraudulent activities are considered as Abnormal behaviors.
|
ICACT20240309 Slide.03
[Big Slide]
|
Chrome Click!! |
 |
AI-based anomaly detection methods leverage the power of algorithms and statistical models to learn and recognize patterns in data, enabling them to automatically detect anomalies that might be difficult to identify through traditional approaches.
But AI-based anomaly detection methods are only as effect as the data they are trained on. That cause a critical limitation when applying AI-based anomaly detection methods in the real-world network environments.
Moreover, the landscape of malicious attacks is constantly evolving, giving rise to new forms and types of threats. To build effective anomaly detection systems, it is essential to have access to data that accurately reflects the latest network conditions and includes the most recent malicious attacks.
|
ICACT20240309 Slide.04
[Big Slide]
|
Chrome Click!! |
 |
To address the existing problem, we proposed a novel method and system for generating training data to support AI-based anomaly detection. Our approach is grounded in collecting real-world network traffic data, offering a distinct advantage in accurately reflecting the unique characteristics of the network under consideration. Furthermore, our system is designed to incorporate data related to the latest malicious attacks within the network, ensuring that AI-based anomaly detection methods are well-equipped to handle the dynamic nature of cybersecurity threats.
|
ICACT20240309 Slide.05
[Big Slide]
|
Chrome Click!! |
 |
Related work |
ICACT20240309 Slide.06
[Big Slide]
|
Chrome Click!! |
 |
Chen et al. analyzed network traffic and employed machine learning methods to identify abnormal behavior and detect malicious apps. They used imbalanced classification methods, including the Synthetic Minority Oversampling Technique combined with Support Vector Machine, SVM Cost-Sensitive, and C4.5 Cost-Sensitive methods. While this approach performed well with highly imbalanced training data, its performance became unstable when the dataset's imbalance ratio was under 1000.
|
ICACT20240309 Slide.07
[Big Slide]
|
Chrome Click!! |
 |
Hamamoto et al. utilized Genetic Algorithms to analyze the network and subsequently employed a Fuzzy Logic scheme to determine whether an instance represents an anomaly. This method exhibits high performance in Denial of Service and Distributed Denial of Service attack detection, but it is associated with a high false-negative rate.
|
ICACT20240309 Slide.08
[Big Slide]
|
Chrome Click!! |
 |
Alauthman et al. employ the output of a supervised learning model as the state in the reinforcement learning model, both the SL and RL model are improving through this interaction. This method has a good accuracy rate when the input data is reduced in the model, leading to reduced training time. However, it has been validated in MATLAB using three datasets and has not been implemented in a real network.
|
ICACT20240309 Slide.09
[Big Slide]
|
Chrome Click!! |
 |
Wei et al. employ Convolutional Neural Networks to learn spatial features in the data and use Recurrent Neural Networks with long-short term memory to learn temporal features. Subsequently, the original datasets DARPA1998 and ISCX2012 undergo preprocessing. The advantage is the improved performance achieved, this approach has only been validated using a fixed dataset, which can be considered a limitation.
|
ICACT20240309 Slide.10
[Big Slide]
|
Chrome Click!! |
 |
Existing research in this domain has often relied on open datasets offered by various laboratories and research institutions, such as Swat, WaDI, SMAP and MSL.
SWAT simulates the operations of a real-world industrial water treatment plant. SWaT was run and data were collected over an 11-day period. The first 7 days were dedicated to normal data collection without any attacks or errors, while the remaining 4 days involved 36 attacks created by the research team. This dataset includes physical properties relevant to the plant and the water treatment process, as well as network traffic within the testbed.
|
ICACT20240309 Slide.11
[Big Slide]
|
Chrome Click!! |
 |
The WADI dataset was collected from the WADI testbed, which is an extension of the SWAT testbed. This dataset comprises data from 1233 sensors and actuators and was collected over a 16-day period. Of these 16 days, 14 days were dedicated to normal data collection, while the remaining 2 days involved 15 attacks.
|
ICACT20240309 Slide.12
[Big Slide]
|
Chrome Click!! |
 |
The SMAP and MSL datasets consist of expert-labeled telemetry anomaly data from NASA's Soil Moisture Active Passive satellite and Mars Science Laboratory rover. The number of SMAP variables is 1375, while the that of MSL features is 1485, making them significantly more extensive compared to single-entity datasets. These open datasets only reflect the characteristics of the specific networks to which they are applied. While SWAT and WADI datasets pertain to networks in a simulated water plant, SMAP and MSL datasets deal with telemetry data.
|
ICACT20240309 Slide.13
[Big Slide]
|
Chrome Click!! |
 |
Proposed method |
ICACT20240309 Slide.14
[Big Slide]
|
Chrome Click!! |
 |
The core of our system revolves around a three-step process described in this figure. First, we collected log information from security devices in public institutions. Then, we analyzed the collected IP addresses, comparing them to Threat Intelligence (TI) information to check for potential threats. When a threat is detected, the system is automatically applied to the relevant policy to agencies with API connection, while the rest without API connection receive email notifications about the threat.
|
ICACT20240309 Slide.15
[Big Slide]
|
Chrome Click!! |
 |
The data collecting process is described this Figure. In each target network, packets are transmitted from the switch to the data collecting servers using packet control techniques such as mirroring or inline. The output of the data collecting servers consists of log files or extracted files with unique network characteristics. These files are subsequently transmitted to a virtual machine via the Internet. Before being sent to the virtual machine, a firewall is configured to receive only the traffic directed to a specific port number and the IP address of the collection sensor associated with the relevant agency. Finally, the extracted files have been successfully transmitted by sending an acknowledgment message.
|
ICACT20240309 Slide.16
[Big Slide]
|
Chrome Click!! |
 |
The core of our system is the data collecting process that takes place at the collecting server. Packets are transmitted from the Packet Collecting module to the Metadata extracting module, where the network traffic is analyzed. This process extracts application protocols and information of flow, including IP addresses, port numbers, packet counters, and additional details corresponding to the protocol type. The output of Metadata extracting module, in the form of a string (raw data), is transmitted through a pipeline and stored in a temporary file in minutes. During the data processing, this temporary file is read, and the timestamp is converted to UNIX time format. After removing any abnormal lines, such as those lacking proper newline delimiters, the information is saved to a final file, and the temporary file is deleted. Statistical information regarding the extracted protocols is updated on a minute-by-minute basis.
|
ICACT20240309 Slide.17
[Big Slide]
|
Chrome Click!! |
 |
Each metadata record is comprised of 44 features. Features 1-3 explain the primary protocol, sub-protocol, and Layer 4 protocol. Features 4-7 pertain to flow start and end times in seconds and sub-seconds. Information about the IP addresses and port numbers of both the source and destination is provided in features 8-11. Features 12-19 provide information regarding the number of packets, packet size, and actual valid data throughput in both directions (source to destination and vice versa, destination to source). Features 20-41 represent the count of packets with specific flags, such as TCP Congestion Window Reduced, ECN-Echo, Urgent, Acknowledge, Push, Reset, Synchronized Sequence Numbers, and the Finish flag, enabled within the flow, in each direction from source to destination and from destination to source. Additional flow-related information is provided in the final feature.
|
ICACT20240309 Slide.18
[Big Slide]
|
Chrome Click!! |
 |
Experiment result |
ICACT20240309 Slide.19
[Big Slide]
|
Chrome Click!! |
 |
Real data collection is illustrated below. The system detects a variety of application protocols, including TLS, OpenDNS, Radius, Azure, Microsoft 365, BitTorrent, HTTP, Google, DNS, etc. IP address and port number information has been masked for security purposes. Actual traffic information is retained for the training of AI-based anomaly detection models.
|
ICACT20240309 Slide.20
[Big Slide]
|
Chrome Click!! |
 |
Conclusions |
ICACT20240309 Slide.21
[Big Slide]
|
Chrome Click!! |
 |
While AI-based anomaly detection is gaining widespread attention and application, the reliance on open datasets for research and testing has certain limitations. Existing open datasets, collected from specific networks, may not be directly applicable to other network environments due to variations in normal and malicious packet behaviors. As a solution, we have proposed a system that collects packets directly from live networks, produced a more accurate representation of the network's unique characteristics. This approach to data collection not only enhances the performance of AI-based anomaly detection but also contributes to the ongoing development of more adaptable systems.
|
ICACT20240309 Slide.22
[Big Slide]
|
Chrome Click!! |
 |
|
ICACT20240309 Slide.23
[Big Slide]
|
Chrome Click!! |
 |
Thank you for your attention. I'm now open to any questions you may have.
|