| Title: | Leave One Out Kernel Density Estimates for Outlier Detection |
|---|---|
| Description: | Outlier detection using leave-one-out kernel density estimates and extreme value theory. The bandwidth for kernel density estimates is computed using persistent homology, a technique in topological data analysis. Using peak-over-threshold method, a generalized Pareto distribution is fitted to the log of leave-one-out kde values to identify outliers. |
| Authors: | Sevvandi Kandanaarachchi [aut, cre] (ORCID: <https://orcid.org/0000-0002-0337-0395>), Rob Hyndman [aut] (ORCID: <https://orcid.org/0000-0002-2140-5352>), Chris Fraley [ctb] |
| Maintainer: | Sevvandi Kandanaarachchi <[email protected]> |
| License: | GPL-3 |
| Version: | 2.0.1.00 |
| Built: | 2026-06-02 09:18:03 UTC |
| Source: | https://github.com/sevvandi/lookout |
Scatterplot of two columns from the data set with outliers highlighted.
## S3 method for class 'lookoutliers' autoplot(object, columns = 1:2, ...)## S3 method for class 'lookoutliers' autoplot(object, columns = 1:2, ...)
object |
The output of the function |
columns |
Which columns of the original data to plot (specified as either numbers or strings) |
... |
Other arguments currently ignored. |
A ggplot object.
X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) lo <- lookout(X) autoplot(lo)X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) lo <- lookout(X) autoplot(lo)
This function plots outlier persistence for a range of significance levels using the algorithm lookout, an outlier detection method that uses leave-one-out kernel density estimates and generalized Pareto distributions to find outliers.
## S3 method for class 'persistingoutliers' autoplot(object, alpha = object$alpha, ...)## S3 method for class 'persistingoutliers' autoplot(object, alpha = object$alpha, ...)
object |
The output of the function |
alpha |
The significance levels to plot. |
... |
Other arguments currently ignored. |
A ggplot object.
X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) plot(X, pch = 19) outliers <- persisting_outliers(X, scale = FALSE) autoplot(outliers)X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) plot(X, pch = 19) outliers <- persisting_outliers(X, scale = FALSE) autoplot(outliers)
This function identifies the bandwidth that is used in the kernel density estimate computation. The function uses topological data analysis (TDA) to find the badnwidth.
find_tda_bw(X, fast = TRUE, gamma = 0.98, use_differences = FALSE)find_tda_bw(X, fast = TRUE, gamma = 0.98, use_differences = FALSE)
X |
The numerical input data in a data.frame, matrix or tibble format. |
fast |
If |
gamma |
Parameter for bandwidth calculation giving the quantile of the
Rips death radii to use for the bandwidth. Default is |
use_differences |
If TRUE, the bandwidth is set to the lower point of the maximum Rips death radii differences. If FALSE, the gamma quantile of the Rips death radii is used. Default is FALSE. |
The bandwidth
X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) find_tda_bw(X, fast = TRUE)X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) find_tda_bw(X, fast = TRUE)
This function identifies outliers using the algorithm lookout, an outlier detection method that uses leave-one-out kernel density estimates and generalized Pareto distributions to find outliers.
lookout( X, alpha = 0.01, beta = 0.9, gamma = 0.98, bw = NULL, gpd = NULL, scale = TRUE, fast = NROW(X) > 1e+05, old_version = FALSE )lookout( X, alpha = 0.01, beta = 0.9, gamma = 0.98, bw = NULL, gpd = NULL, scale = TRUE, fast = NROW(X) > 1e+05, old_version = FALSE )
X |
The numerical input data in a data.frame, matrix or tibble format. |
alpha |
The level of significance. Default is |
beta |
The quantile threshold used in the GPD estimation. Default is |
gamma |
Parameter for bandwidth calculation giving the quantile of the
Rips death radii to use for the bandwidth. Default is |
bw |
Bandwidth parameter. If |
gpd |
Generalized Pareto distribution parameters. If |
scale |
If |
fast |
If |
old_version |
Logical indicator of which version of the algorithm to use. Default is FALSE, meaning the newer version is used. |
A list with the following components:
outliers |
The set of outliers. |
outlier_probability |
The GPD probability of the data. |
outlier_scores |
The outlier scores of the data. |
bandwidth |
The bandwdith selected using persistent homology. |
kde |
The kernel density estimate values. |
lookde |
The leave-one-out kde values. |
gpd |
The fitted GPD parameters. |
Kandanaarachchi, S, and Hyndman, RJ (2022) Leave-one-out kernel density estimates for outlier detection, J Computational & Graphical Statistics, 31(2), 586-599. https://robjhyndman.com/publications/lookout/.
Hyndman, RJ, Kandanaarachchi, S, and Turner, K (2026) When lookout meets crackle: Anomaly detection using kernel density estimation, in preparation. https://robjhyndman.com/publications/lookout2.html
X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) lo <- lookout(X) lo autoplot(lo)X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) lo <- lookout(X) lo autoplot(lo)
This is the time series implementation of lookout which identifies outliers in the double differenced time series.
lookout_ts(x, scale = FALSE, ...)lookout_ts(x, scale = FALSE, ...)
x |
The input univariate time series. |
scale |
If |
... |
Other arguments are passed to |
A lookout object.
set.seed(1) x <- arima.sim(list(order = c(1, 1, 0), ar = 0.8), n = 200) x[50] <- x[50] + 10 plot(x) lo <- lookout_ts(x) loset.seed(1) x <- arima.sim(list(order = c(1, 1, 0), ar = 0.8), n = 200) x[50] <- x[50] + 10 plot(x) lo <- lookout_ts(x) lo
A multivariate version of base::scale(), that takes account
of the covariance matrix of the data, and uses robust estimates
of center, scale and covariance by default. The centers are removed using medians, the
scale function is the IQR, and the covariance matrix is estimated using a
robust OGK estimate. The data are scaled using the Cholesky decomposition of
the inverse covariance. Then the scaled data are returned.
mvscale( object, center = stats::median, scale = robustbase::s_Qn, cov = robustbase::covOGK, warning = TRUE )mvscale( object, center = stats::median, scale = robustbase::s_Qn, cov = robustbase::covOGK, warning = TRUE )
object |
A vector, matrix, or data frame containing some numerical data. |
center |
A function to compute the center of each numerical variable. Set to NULL if no centering is required. |
scale |
A function to scale each numerical variable. When
|
cov |
A function to compute the covariance matrix. Set to NULL if no rotation required. |
warning |
Should a warning be issued if non-numeric columns are ignored? |
Optionally, the centering and scaling can be done for each variable
separately, so there is no rotation of the data, by setting cov = NULL.
Also optionally, non-robust methods can be used by specifying center = mean,
scale = stats::sd(), and cov = stats::cov(). Any non-numeric columns are retained
with a warning.
A vector, matrix or data frame of the same size and class as object,
but with numerical variables replaced by scaled versions.
Rob J Hyndman
base::scale(), stats::sd(), stats::cov(), robustbase::covOGK(), robustbase::s_Qn()
# Univariate z-scores (no rotation) z <- mvscale(faithful, center = mean, scale = sd, cov = NULL, warning = FALSE) # Non-robust scaling with rotation z <- mvscale(faithful, center = mean, cov = stats::cov, warning = FALSE) # Robust scaling and rotation z <- mvscale(faithful, warning = FALSE)# Univariate z-scores (no rotation) z <- mvscale(faithful, center = mean, scale = sd, cov = NULL, warning = FALSE) # Non-robust scaling with rotation z <- mvscale(faithful, center = mean, cov = stats::cov, warning = FALSE) # Robust scaling and rotation z <- mvscale(faithful, warning = FALSE)
This function computes outlier persistence for a range of significance values, using the algorithm lookout, an outlier detection method that uses leave-one-out kernel density estimates and generalized Pareto distributions to find outliers.
persisting_outliers( X, alpha = seq(0.01, 0.1, by = 0.01), st_qq = 0.9, scale = TRUE, num_steps = 20, old_version = FALSE )persisting_outliers( X, alpha = seq(0.01, 0.1, by = 0.01), st_qq = 0.9, scale = TRUE, num_steps = 20, old_version = FALSE )
X |
The input data in a matrix, data.frame, or tibble format. All columns should be numeric. |
alpha |
Grid of significance levels. |
st_qq |
The starting quantile for death radii sequence. This will be used to compute the starting bandwidth value. |
scale |
If |
num_steps |
The length of the bandwidth sequence. |
old_version |
Logical indicator of which version of the algorithm to use. |
A list with the following components:
out |
A 3D array of |
bw |
The set of bandwidth values. |
gpdparas |
The GPD parameters used. |
lookoutbw |
The bandwidth chosen by the algorithm |
X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) plot(X, pch = 19) outliers <- persisting_outliers(X, scale = FALSE) outliers autoplot(outliers)X <- rbind( data.frame( x = rnorm(500), y = rnorm(500) ), data.frame( x = rnorm(5, mean = 10, sd = 0.2), y = rnorm(5, mean = 10, sd = 0.2) ) ) plot(X, pch = 19) outliers <- persisting_outliers(X, scale = FALSE) outliers autoplot(outliers)