Mining Distance-Based Outliers in Near Linear Time

Published by Dashlink | National Aeronautics and Space Administration | Metadata Last Checked: August 04, 2025 | Last Modified: 2025-03-31

Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

Find Related Datasets

Click any tag below to search for similar datasets

Complete Metadata

@type	dcat:Dataset
accessLevel	public
accrualPeriodicity	irregular
bureauCode	[ "026:00" ]
contactPoint	{ "fn": "MARK SCHWABACHER", "@type": "vcard:Contact", "hasEmail": "mailto:mark.a.schwabacher@nasa.gov" }
description	Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
distribution	[ { "@type": "dcat:Distribution", "title": "BaySchwabacherKDD2003.pdf", "format": "PDF", "mediaType": "application/pdf", "description": "Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule", "downloadURL": "https://c3.nasa.gov/dashlink/static/media/publication/BaySchwabacherKDD2003.pdf" } ]
identifier	DASHLINK_191
issued	2010-09-22
keyword	[ "ames", "dashlink", "nasa" ]
landingPage	https://c3.nasa.gov/dashlink/resources/191/
modified	2025-03-31
programCode	[ "026:029" ]
publisher	{ "name": "Dashlink", "@type": "org:Organization" }
title	Mining Distance-Based Outliers in Near Linear Time

1 resource available