Title: | Anomaly Detection in High Dimensional and Temporal Data |
---|---|
Description: | This is a modification of 'HDoutliers' package. The 'HDoutliers' algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. This package implements the algorithm proposed in Talagala, Hyndman and Smith-Miles (2019) <arXiv:1908.04000> for detecting anomalies in high-dimensional data that addresses these limitations of 'HDoutliers' algorithm. We define an anomaly as an observation that deviates markedly from the majority with a large distance gap. An approach based on extreme value theory is used for the anomalous threshold calculation. |
Authors: | Priyanga Dilini Talagala [aut, cre] , Rob J Hyndman [ths] , Kate Smith-Miles [ths] |
Maintainer: | Priyanga Dilini Talagala <[email protected]> |
License: | GPL-2 |
Version: | 0.1.1 |
Built: | 2024-11-14 02:58:19 UTC |
Source: | https://github.com/pridiltal/stray |
A bivariate dataset with an outlier
data_a
data_a
A data frame with 1001 rows and 3 variables:
numerical variable
numerical variable
Type of a data point : Typical or Outlier
A bivariate dataset with two typical classes and a micor cluster
data_b
data_b
A data frame with 2003 rows and 3 variables:
numerical variable
numerical variable
Type of a data point : Typical or Outlier
A bivariate dataset with local anomalies and two micro clusters
data_c
data_c
A data frame with 1009 rows and 3 variables:
numerical variable
numerical variable
Type of a data point : Typical or Outlier
A bivariate dataset with two inliers. The inliers are very close to one another
data_d
data_d
A data frame with 1002 rows and 3 variables:
numerical variable
numerical variable
Type of a data point : Typical or Outlier
A bimodal dataset with an inlier. one typical class is a very dense cluster,
data_e
data_e
A data frame with 2001 rows and 3 variables:
numerical variable
numerical variable
Type of a data point : Typical or Outlier
A dataset with an outlier. The typical class is a very dense cluster.
data_f
data_f
A data frame with 2001 rows and 3 variables:
numerical variable
numerical variable
Type of a data point : Typical or Outlier
Provide a 2D scatterplot of data for visual exploration. For data with more than two dimensions, two dimensional scatterplot is produced using the first two pricipal components.
display_HDoutliers(data, out)
display_HDoutliers(data, out)
data |
A vector, matrix, or data frame consisting of numerical variables. |
out |
A list containing output values produced by |
A ggplot object of data space with detected outliers (if any).
data <- c(rnorm(100), 7, 7.5, rnorm(100, 20), 45) output <- find_HDoutliers(data, knnsearchtype = "kd_tree") display_HDoutliers(data, out = output) data <- rbind(matrix(rnorm(96), ncol = 2), c(10,12),c(3,7)) output <- find_HDoutliers(data, knnsearchtype = "brute") display_HDoutliers(data, out = output) data <- rbind(matrix(rnorm(144), ncol = 3), c(10,12,10),c(3,7,10)) output <- find_HDoutliers(data, knnsearchtype = "brute") display_HDoutliers(data, out = output)
data <- c(rnorm(100), 7, 7.5, rnorm(100, 20), 45) output <- find_HDoutliers(data, knnsearchtype = "kd_tree") display_HDoutliers(data, out = output) data <- rbind(matrix(rnorm(96), ncol = 2), c(10,12),c(3,7)) output <- find_HDoutliers(data, knnsearchtype = "brute") display_HDoutliers(data, out = output) data <- rbind(matrix(rnorm(144), ncol = 3), c(10,12,10),c(3,7,10)) output <- find_HDoutliers(data, knnsearchtype = "brute") display_HDoutliers(data, out = output)
Detect anomalies in high dimensional data. This is a modification of
HDoutliers
.
find_HDoutliers( data, alpha = 0.01, k = 10, knnsearchtype = "brute", normalize = "unitize", p = 0.5, tn = 50 )
find_HDoutliers( data, alpha = 0.01, k = 10, knnsearchtype = "brute", normalize = "unitize", p = 0.5, tn = 50 )
data |
A vector, matrix, or data frame consisting of numerical variables. |
alpha |
Threshold for determining the cutoff for outliers. Observations are considered
outliers if they fall in the |
k |
Number of neighbours considered. |
knnsearchtype |
A character vector indicating the search type for k- nearest-neighbors. |
normalize |
Method to normalize the columns of the data. This prevents variables with large variances having disproportional influence on Euclidean distances. Two options are available "standardize" or "unitize". Default is set to "unitize" |
p |
Proportion of possible candidates for outliers. This defines the starting point for the bottom up searching algorithm. Default is set to 0.5. |
tn |
Sample size to calculate an emperical threshold. Default is set to 50. |
The indexes of the observations determined to be outliers.
Wilkinson, L. (2018), 'Visualizing big data outliers through distributed aggregation', IEEE transactions on visualization and computer graphics 24(1), 256-266.
require(ggplot2) set.seed(1234) data <- c(rnorm(1000, mean = -6), 0, rnorm(1000, mean = 6)) outliers <- find_HDoutliers(data, knnsearchtype = "kd_tree") set.seed(1234) n <- 1000 # number of observations nout <- 10 # number of outliers typical_data <- matrix(rnorm(2 * n), ncol = 2, byrow = TRUE) out <- matrix(5 * runif(2 * nout, min = -5, max = 5), ncol = 2, byrow = TRUE) data <- rbind(out, typical_data) outliers <- find_HDoutliers(data, knnsearchtype = "brute")
require(ggplot2) set.seed(1234) data <- c(rnorm(1000, mean = -6), 0, rnorm(1000, mean = 6)) outliers <- find_HDoutliers(data, knnsearchtype = "kd_tree") set.seed(1234) n <- 1000 # number of observations nout <- 10 # number of outliers typical_data <- matrix(rnorm(2 * n), ncol = 2, byrow = TRUE) out <- matrix(5 * runif(2 * nout, min = -5, max = 5), ncol = 2, byrow = TRUE) data <- rbind(out, typical_data) outliers <- find_HDoutliers(data, knnsearchtype = "brute")
Find Outlier Threshold
find_threshold( outlier_score, alpha = 0.01, outtail = c("max", "min"), p = 0.5, tn = 50 )
find_threshold( outlier_score, alpha = 0.01, outtail = c("max", "min"), p = 0.5, tn = 50 )
outlier_score |
A vector of outlier scores. Can be a named vector or a vector with no names. |
alpha |
Threshold for determining the cutoff for outliers. Observations are considered
outliers if they fall in the |
outtail |
Direction of the outlier tail. |
p |
Proportion of possible candidates for outliers. This defines the starting point for the bottom up searching algorithm. |
tn |
Sample size to calculate an empirical threshold |
The indexes (or names, if the input is named vactor) of the observations determined to be outliers.
A dataset with hourly pedestrian counts at 43 locations in the city Melbourne, australia, from 1 December, 2018 to 1, January, 2019.
ped_data
ped_data
A data frame with 33024 rows and 5 variables:
Sensor location
Time and date
Date
Time
Pedestrian count
This package is a modification of HDoutliers
package. HDoutliers
is a powerful algorithm for the
detection of anomalous observations in a dataset, which has (among other advantages) the ability to detect
clusters of outliers in multi-dimensional data without requiring a model of the typical behavior of the system.
However, it suffers from some limitations that affect its accuracy. In this package, we propose solutions to
the limitations of HDoutliers, and propose an extension of the algorithm to deal with data streams that exhibit
non-stationary behavior. The results show that our proposed algorithm improves the accuracy, and enables the
trade-off between false positives and negatives to be better balanced.
The name stray
comes from Search and TRace AnomalY
Talagala, P. D., Hyndman, R. J., & Smith-Miles, K. (2019). Anomaly Detection in High Dimensional Data. https://www.monash.edu/business/ebs/research/publications/ebs/wp20-2019.pdf
Wilkinson, L. (2017). Visualizing big data outliers through distributed aggregation. IEEE transactions on visualization and computer graphics, 24(1), 256-266. https://www.cs.uic.edu/~wilkinson/Publications/outliers.pdf
The core functions in this package: find_HDoutliers
,
display_HDoutliers
Full documentation and demos:
Find outliers using kNN distance with maximum gap
use_KNN(data, alpha, k, knnsearchtype, p, tn)
use_KNN(data, alpha, k, knnsearchtype, p, tn)
data |
A vector, matrix, or data frame consisting of numeric and/or categorical variables. |
alpha |
Threshold for determining the cutoff for outliers. Observations are considered
outliers outliers if they fall in the |
k |
Number of neighbours considered. |
knnsearchtype |
A character vector indicating the search type for k- nearest-neighbors. |
p |
Proportion of possible candidates for outliers. This defines the starting point for the bottom up searching algorithm. |
tn |
Sample size to calculate an emperical threshold. Default is set to 50. |
The indexes of the observations determined to be outliers and the outlying scores.
A bivariate dataset with an inlier and anoutlier
wheel1
wheel1
A data frame with 1002 rows and 3 variables:
numerical variable
numerical variable
Type of a data point : Typical or Outlier