Behavior-based Statistics Preserving Network Trace Anonymization

In modern network measurement research, there exists a clear and demonstrable need for open sharing of large-scale network traffic datasets between organizations. Beyond network measurement, many security-related fields, such as those focused on detecting new exploits or worm outbreaks, stand to benefit given the ability to easily correlate information between several different sources.

Currently, the primary factor limiting such sharing is the risk of disclosing private information. While prior anonymization work has focused on traffic con- tent, analysis based on statistical behavior patterns within network traffic has, so far, been under-explored. The statistics preserving network trace anonymization project explores a new behavior-based approach towards network trace source-anonymization, one that is motivated by anonymity-by-crowds where traffic mixing is conditioned on the statistical similarity in host behavior. We develop new time-series models for network traffic and kernel metrics for similarity. Anonymity and statistics-preservation are framed as congruent objectives in an unsupervised-learning problem. Source-anonymity is connected directly to the group size and homogeneity under this approach, and metrics for these properties are derived.

Optimal segmentation of the population into anonymized groups is approximated with a graph-partitioning problem. Algorithms that guarantee a minimum anonymity-set size are presented, as well as novel techniques for behavior visualization and compression. Empirical evaluations on a range of network traffic datasets show significant advantages in both accuracy and runtime over similar solutions.