Abstract: Anomaly detection is a ubiquitous problem in machine learning. Here one is given a large population of points, we may not have much knowledge about their structure a priori. The goal is to figure out what is "typical" of the datatset, and what points are atypical or anomalous. This talk will explore a framework called Partial Identification for identifying anomalies in a large datatset.
We propose a definition for “anomalousness” that captures the intuition that anomalies are easy to distinguish from the overwhelming majority of points by relatively few attribute values: we call this partial identification. Our notion is inspired by the notion of Partial IDs that were studied by Yehudayoff and Wigderson in the context of population recovery. Formalizing this intuition, we propose a geometric anomaly measure for a point that we call PIDScore, which measures for the minimum density of data points over all subcubes containing the point. We present PIDForest: a random forest based algorithm that finds anomalies based on this definition and show that it performs favorably in comparison to several popular anomaly detection methods, across a broad range of benchmarks.
Based on joint work with Udi Wieder (VMware) and Vatsal Sharan (Stanford) that appeared in NeurIPS 2019.