Title: | Leave One Out Kernel Density Estimates for Outlier Detection |
---|---|
Description: | Outlier detection using leave-one-out kernel density estimates and extreme value theory. The bandwidth for kernel density estimates is computed using persistent homology, a technique in topological data analysis. Using peak-over-threshold method, a generalized Pareto distribution is fitted to the log of leave-one-out kde values to identify outliers. |
Authors: | Sevvandi Kandanaarachchi [aut, cre] , Rob Hyndman [aut] , Chris Fraley [ctb] |
Maintainer: | Sevvandi Kandanaarachchi <[email protected]> |
License: | GPL-3 |
Version: | 0.1.5 |
Built: | 2024-11-12 03:40:10 UTC |
Source: | https://github.com/sevvandi/lookout |
Scatterplot of two columns from the data set with outliers highlighted.
## S3 method for class 'lookoutliers' autoplot(object, columns = 1:2, ...)
## S3 method for class 'lookoutliers' autoplot(object, columns = 1:2, ...)
object |
The output of the function 'lookout'. |
columns |
Which columns of the original data to plot (specified as either numbers or strings) |
... |
Other arguments currently ignored. |
A ggplot object.
X <- rbind( data.frame(x = rnorm(500), y = rnorm(500)), data.frame(x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2)) ) lo <- lookout(X) autoplot(lo)
X <- rbind( data.frame(x = rnorm(500), y = rnorm(500)), data.frame(x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2)) ) lo <- lookout(X) autoplot(lo)
This function plots outlier persistence for a range of significance levels using the algorithm lookout, an outlier detection method that uses leave-one-out kernel density estimates and generalized Pareto distributions to find outliers.
## S3 method for class 'persistingoutliers' autoplot(object, alpha = object$alpha, ...)
## S3 method for class 'persistingoutliers' autoplot(object, alpha = object$alpha, ...)
object |
The output of the function 'persisting_outliers'. |
alpha |
The significance levels to plot. |
... |
Other arguments currently ignored. |
A ggplot object.
X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) plot(X, pch = 19) outliers <- persisting_outliers(X, unitize = FALSE) autoplot(outliers)
X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) plot(X, pch = 19) outliers <- persisting_outliers(X, unitize = FALSE) autoplot(outliers)
This function identifies the bandwidth that is used in the kernel density estimate computation. The function uses topological data analysis (TDA) to find the badnwidth.
find_tda_bw(X, fast)
find_tda_bw(X, fast)
X |
The input data in a dataframe, matrix or tibble format. |
fast |
If set to |
The bandwidth
X <- rbind( data.frame(x = rnorm(500), y = rnorm(500)), data.frame(x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2)) ) find_tda_bw(X, fast = TRUE)
X <- rbind( data.frame(x = rnorm(500), y = rnorm(500)), data.frame(x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2)) ) find_tda_bw(X, fast = TRUE)
This function identifies outliers using the algorithm lookout, an outlier detection method that uses leave-one-out kernel density estimates and generalized Pareto distributions to find outliers.
lookout(X, alpha = 0.05, unitize = TRUE, bw = NULL, gpd = NULL, fast = TRUE)
lookout(X, alpha = 0.05, unitize = TRUE, bw = NULL, gpd = NULL, fast = TRUE)
X |
The input data in a dataframe, matrix or tibble format. |
alpha |
The level of significance. Default is |
unitize |
An option to normalize the data. Default is |
bw |
Bandwidth parameter. Default is |
gpd |
Generalized Pareto distribution parameters. If 'NULL' (the default), these are estimated from the data. |
fast |
If set to |
A list with the following components:
outliers |
The set of outliers. |
outlier_probability |
The GPD probability of the data. |
outlier_scores |
The outlier scores of the data. |
bandwidth |
The bandwdith selected using persistent homology. |
kde |
The kernel density estimate values. |
lookde |
The leave-one-out kde values. |
gpd |
The fitted GPD parameters. |
X <- rbind( data.frame(x = rnorm(500), y = rnorm(500)), data.frame(x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2)) ) lo <- lookout(X) lo autoplot(lo)
X <- rbind( data.frame(x = rnorm(500), y = rnorm(500)), data.frame(x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2)) ) lo <- lookout(X) lo autoplot(lo)
This is the time series implementation of lookout.
lookout_ts(x, alpha = 0.05)
lookout_ts(x, alpha = 0.05)
x |
The input univariate time series. |
alpha |
The level of significance. Default is |
A lookout object.
set.seed(1) x <- arima.sim(list(order = c(1,1,0), ar = 0.8), n = 200) x[50] <- x[50] + 10 plot(x) lo <- lookout_ts(x) lo
set.seed(1) x <- arima.sim(list(order = c(1,1,0), ar = 0.8), n = 200) x[50] <- x[50] + 10 plot(x) lo <- lookout_ts(x) lo
This function computes outlier persistence for a range of significance values, using the algorithm lookout, an outlier detection method that uses leave-one-out kernel density estimates and generalized Pareto distributions to find outliers.
persisting_outliers( X, alpha = seq(0.01, 0.1, by = 0.01), st_qq = 0.9, unitize = TRUE, num_steps = 20 )
persisting_outliers( X, alpha = seq(0.01, 0.1, by = 0.01), st_qq = 0.9, unitize = TRUE, num_steps = 20 )
X |
The input data in a matrix, data.frame, or tibble format. All columns should be numeric. |
alpha |
Grid of significance levels. |
st_qq |
The starting quantile for death radii sequence. This will be used to compute the starting bandwidth value. |
unitize |
An option to normalize the data. Default is |
num_steps |
The length of the bandwidth sequence. |
A list with the following components:
out |
A 3D array of |
bw |
The set of bandwidth values. |
gpdparas |
The GPD parameters used. |
lookoutbw |
The bandwidth chosen by the algorithm |
X <- rbind( data.frame(x = rnorm(500), y = rnorm(500)), data.frame(x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2)) ) plot(X, pch = 19) outliers <- persisting_outliers(X, unitize = FALSE) outliers autoplot(outliers)
X <- rbind( data.frame(x = rnorm(500), y = rnorm(500)), data.frame(x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2)) ) plot(X, pch = 19) outliers <- persisting_outliers(X, unitize = FALSE) outliers autoplot(outliers)