Title: | Functions and Data Sets for "That's Weird: Anomaly Detection Using R" by Rob J Hyndman |
---|---|
Description: | All functions and data sets required for the examples in the book Hyndman (2024) "That's Weird: Anomaly Detection Using R" <https://OTexts.com/weird/>. All packages needed to run the examples are also loaded. |
Authors: | Rob Hyndman [aut, cre, cph] , RStudio [cph] |
Maintainer: | Rob Hyndman <[email protected]> |
License: | GPL-3 |
Version: | 1.0.2.9000 |
Built: | 2024-11-13 07:15:48 UTC |
Source: | https://github.com/robjhyndman/weird-package |
A dataset containing career batting statistics for all international test players (men and women) up to 6 October 2021.
cricket_batting
cricket_batting
A data frame with 3754 rows and 15 variables:
Player name in form of "initials surname"
Country played for
First year of test playing career
Last year of test playing career
Number of matches played
Number of innings batted
Number of times not out
Total runs scored
Highest score in an innings
Was highest score not out?
Batting average at end of career
Total number of 100s scored
Total number of 50s scored
Total number of 0s scored
"Men" or "Women"
Data frame
cricket_batting |> filter(Innings > 20) |> select(Player, Country, Matches, Runs, Average, Hundreds, Fifties, Ducks) |> arrange(desc(Average))
cricket_batting |> filter(Innings > 20) |> select(Player, Country, Matches, Runs, Average, Hundreds, Fifties, Ducks) |> arrange(desc(Average))
Creates a distributional object using a density specified as pair of vectors giving (x, f(x)). The density is assumed to be piecewise linear between the points provided, and 0 outside the range of x.
dist_density(x, density)
dist_density(x, density)
x |
Numerical vector of ordinates, or a list of such vectors. |
density |
Numerical vector of density values, or a list of such vectors. |
dist_density(seq(-4, 4, by = 0.01), dnorm(seq(-4, 4, by = 0.01)))
dist_density(seq(-4, 4, by = 0.01), dnorm(seq(-4, 4, by = 0.01)))
Creates a distributional object using a kernel density estimate with a
Gaussian kernel obtained from the kde()
function. The bandwidth
can be specified; otherwise the kde_bandwidth()
function is used.
The cdf, quantiles and moments are consistent with the kde. Generating
random values from the kde is equivalent to a smoothed bootstrap.
dist_kde(y, h = NULL, H = NULL, multiplier = 1, ...)
dist_kde(y, h = NULL, H = NULL, multiplier = 1, ...)
y |
Numerical vector or matrix of data, or a list of such objects. If a list is provided, then all objects should be of the same dimension. e.g., all vectors, or all matrices with the same number of columns. |
h |
Bandwidth for univariate distribution. If |
H |
Bandwidth matrix for multivariate distribution. If |
multiplier |
Multiplier for bandwidth passed to |
... |
Other arguments are passed to |
dist_kde(c(rnorm(200), rnorm(100, 5)), multiplier = 2) dist_kde(cbind(rnorm(200), rnorm(200, 5)))
dist_kde(c(rnorm(200), rnorm(100, 5)), multiplier = 2) dist_kde(cbind(rnorm(200), rnorm(200, 5)))
A data set containing data on wines from 44 countries, taken from Wine Enthusiast Magazine during the week of 15 June 2017. The data are downloaded and returned.
fetch_wine_reviews()
fetch_wine_reviews()
A data frame with 110,203 rows and 8 columns:
Country of origin
State or province of origin
Region of origin
Name of vineyard that made the wine
Variety of grape
Points allocated by WineEnthusiast reviewer on a scale of 0-100
Price of a bottle of wine in $US
Year of wine extracted from title
Data frame
## Not run: wine_reviews <- fetch_wine_reviews() wine_reviews |> ggplot(aes(x = points, y = price)) + geom_jitter(height = 0, width = 0.2, alpha = 0.1) + scale_y_log10() ## End(Not run)
## Not run: wine_reviews <- fetch_wine_reviews() wine_reviews |> ggplot(aes(x = points, y = price)) + geom_jitter(height = 0, width = 0.2, alpha = 0.1) + scale_y_log10() ## End(Not run)
A data set containing French mortality rates between the years 1816 and 1999, by age and sex.
fr_mortality
fr_mortality
A data frame with 31,648 rows and 4 columns.
Data frame
Human Mortality Database https://www.mortality.org
fr_mortality
fr_mortality
Produces a bivariate bagplot. A bagplot is analagous to a univariate boxplot, except it is in two dimensions. Like a boxplot, it shows the median, a region containing 50% of the observations, a region showing the remaining observations other than outliers, and any outliers.
gg_bagplot(data, var1, var2, color = "#00659e", scatterplot = FALSE, ...)
gg_bagplot(data, var1, var2, color = "#00659e", scatterplot = FALSE, ...)
data |
A data frame or matrix containing the data. |
var1 |
The name of the first variable to plot (a bare expression). |
var2 |
The name of the second variable to plot (a bare expression). |
color |
The base color to use for the median. Other colors are generated
as a mixture of |
scatterplot |
A logical argument indicating if a regular bagplot is required
( |
... |
Other arguments are passed to the |
A ggplot object showing a bagplot or scatterplot of the data.
Rob J Hyndman
Rousseeuw, P. J., Ruts, I., & Tukey, J. W. (1999). The bagplot: A bivariate boxplot. The American Statistician, 52(4), 382–387.
gg_bagplot(n01, v1, v2) gg_bagplot(n01, v1, v2, scatterplot = TRUE)
gg_bagplot(n01, v1, v2) gg_bagplot(n01, v1, v2, scatterplot = TRUE)
Produce ggplot of densities from distributional objects in 1 or 2 dimensions
gg_density( object, prob = seq(9)/10, hdr = NULL, show_points = FALSE, show_mode = FALSE, show_anomalies = FALSE, colors = c("#0072b2", "#D55E00", "#009E73", "#CC79A7", "#E69F00", "#56B4E9", "#F0E442", "#333333"), alpha = NULL, jitter = FALSE, ngrid = 501 )
gg_density( object, prob = seq(9)/10, hdr = NULL, show_points = FALSE, show_mode = FALSE, show_anomalies = FALSE, colors = c("#0072b2", "#D55E00", "#009E73", "#CC79A7", "#E69F00", "#56B4E9", "#F0E442", "#333333"), alpha = NULL, jitter = FALSE, ngrid = 501 )
object |
distribution object from the distributional package or
|
prob |
Probability of the HDRs to be drawn. |
hdr |
Character string describing how the HDRs are to be shown. Options
are "none", "fill", "points" and "contours" (the latter only for bivariate plots).
If |
show_points |
If |
show_mode |
If |
show_anomalies |
If |
colors |
Color palette to use. If there are more than
|
alpha |
Transparency of points. Ignored if |
jitter |
For univariate distributions, when |
ngrid |
Number of grid points to use for the density function. |
This function produces a ggplot of a density from a distributional object.
For univariate densities, it produces a line plot of the density function, with
an optional ribbon showing some highest density regions (HDRs) and/or the observations.
For bivariate densities, it produces ah HDR contour plot of the density function, with
the observations optionally shown as points.
The mode can also be drawn as a point.
The combination of hdr = "fill"
, show_points = TRUE
,
show_mode = TRUE
, and prob = c(0.5, 0.99)
is equivalent to showing
HDR boxplots.
A ggplot object.
Rob J Hyndman
# Univariate densities kde <- dist_kde(c(rnorm(500), rnorm(500, 4, .5))) gg_density(kde, hdr = "fill", prob = c(0.5, 0.95), color = "#c14b14", show_mode = TRUE, show_points = TRUE, jitter = TRUE ) c(dist_normal(), kde) |> gg_density(hdr = "fill", prob = c(0.5, 0.95)) # Bivariate density tibble(y1 = rnorm(5000), y2 = y1 + rnorm(5000)) |> dist_kde() |> gg_density(show_points = TRUE, alpha = 0.1, hdr = "fill")
# Univariate densities kde <- dist_kde(c(rnorm(500), rnorm(500, 4, .5))) gg_density(kde, hdr = "fill", prob = c(0.5, 0.95), color = "#c14b14", show_mode = TRUE, show_points = TRUE, jitter = TRUE ) c(dist_normal(), kde) |> gg_density(hdr = "fill", prob = c(0.5, 0.95)) # Bivariate density tibble(y1 = rnorm(5000), y2 = y1 + rnorm(5000)) |> dist_kde() |> gg_density(show_points = TRUE, alpha = 0.1, hdr = "fill")
Add ggplot layer of densities from distributional objects in 1 dimension
gg_density_layer(object, scale = 1, ngrid = 501, ...)
gg_density_layer(object, scale = 1, ngrid = 501, ...)
object |
distribution object from the distributional package or
|
scale |
Scaling factor for the density function. |
ngrid |
Number of grid points to use for the density function. |
... |
Additional arguments are passed to |
This function adds a ggplot layer of a density from a distributional object. For univariate densities, it adds a line plot of the density function. For bivariate densities, it adds a contour plot of the density function.
A ggplot layer
Rob J Hyndman
dist_mixture( dist_normal(-2, 1), dist_normal(2, 1), weights = c(1 / 3, 2 / 3) ) |> gg_density() + gg_density_layer(dist_normal(-2, 1), linetype = "dashed", scale = 1 / 3) + gg_density_layer(dist_normal(2, 1), linetype = "dashed", scale = 2 / 3)
dist_mixture( dist_normal(-2, 1), dist_normal(2, 1), weights = c(1 / 3, 2 / 3) ) |> gg_density() + gg_density_layer(dist_normal(-2, 1), linetype = "dashed", scale = 1 / 3) + gg_density_layer(dist_normal(2, 1), linetype = "dashed", scale = 2 / 3)
Produces a 1d or 2d box plot of HDR regions. The darker regions contain observations with higher probability, while the lighter regions contain points with lower probability. Observations outside the largest HDR are shown as individual points. Anomalies with leave-one-out surprisal probabilities less than 0.005 are optionally shown in black.
gg_hdrboxplot( data, var1, var2 = NULL, prob = c(0.5, 0.99), color = "#0072b2", show_points = FALSE, show_anomalies = TRUE, scatterplot = show_points, alpha = NULL, jitter = TRUE, ngrid = 501, ... )
gg_hdrboxplot( data, var1, var2 = NULL, prob = c(0.5, 0.99), color = "#0072b2", show_points = FALSE, show_anomalies = TRUE, scatterplot = show_points, alpha = NULL, jitter = TRUE, ngrid = 501, ... )
data |
A data frame or matrix containing the data. |
var1 |
The name of the first variable to plot (a bare expression). |
var2 |
Optionally, the name of the second variable to plot (a bare expression). |
prob |
A numeric vector specifying the coverage probabilities for the HDRs. |
color |
The base color to use for the mode. Colors for the HDRs are generated by whitening this color. |
show_points |
A logical argument indicating if a regular HDR plot is required
( |
show_anomalies |
A logical argument indicating if the surprisal anomalies should be shown (in black). These are points with leave-one-out surprisal probability values less than 0.005, and which lie outside the 99% HDR region. |
scatterplot |
Equivalent to |
alpha |
Transparency of points. Ignored if |
jitter |
A logical value indicating if the points should be vertically jittered for the 1d box plots to reduce overplotting. |
ngrid |
Number of grid points to use for the density function. |
... |
Other arguments passed to |
The original HDR boxplot proposed by Hyndman (1996), can be produced
with show_anomalies = FALSE
, jitter = FALSE
, alpha = 1
, and all other
arguments set to their defaults.
A ggplot object showing an HDR plot or scatterplot of the data.
Rob J Hyndman
Hyndman, R J (1996) Computing and Graphing Highest Density Regions, The American Statistician, 50(2), 120–126. https://robjhyndman.com/publications/hdr/
df <- data.frame(x = c(rnorm(1000), rnorm(1000, 5, 1), 10)) gg_hdrboxplot(df, x, show_anomalies = TRUE) cricket_batting |> filter(Innings > 20) |> gg_hdrboxplot(Average) oldfaithful |> filter(duration < 7000, waiting < 7000) |> gg_hdrboxplot(duration, waiting, show_points = TRUE)
df <- data.frame(x = c(rnorm(1000), rnorm(1000, 5, 1), 10)) gg_hdrboxplot(df, x, show_anomalies = TRUE) cricket_batting |> filter(Innings > 20) |> gg_hdrboxplot(Average) oldfaithful |> filter(duration < 7000, waiting < 7000) |> gg_hdrboxplot(duration, waiting, show_points = TRUE)
Compute Global-Local Outlier Score from Hierarchies. This is based
on hierarchical clustering where the minimum cluster size is k. The resulting
outlier score is a measure of how anomalous each observation is.
The function uses dbscan::hdbscan
to do the calculation.
glosh_scores(y, k = 10, ...)
glosh_scores(y, k = 10, ...)
y |
Numerical matrix or vector of data |
k |
Minimum cluster size. Default: 5. |
... |
Additional arguments passed to |
Numerical vector containing GLOSH values
Rob J Hyndman
dbscan::glosh
y <- c(rnorm(49), 5) glosh_scores(y)
y <- c(rnorm(49), 5) glosh_scores(y)
Grubbs' test (proposed in 1950) identifies possible anomalies in univariate data using z-scores assuming the data come from a normal distribution. Dixon's test (also from 1950) compares the difference in the largest two values to the range of the data. Critical values for Dixon's test have been computed using simulation with interpolation using a quadratic model on logit(alpha) and log(log(n)).
grubbs_anomalies(y, alpha = 0.05) dixon_anomalies(y, alpha = 0.05, two_sided = TRUE)
grubbs_anomalies(y, alpha = 0.05) dixon_anomalies(y, alpha = 0.05, two_sided = TRUE)
y |
numerical vector of observations |
alpha |
size of the test. |
two_sided |
If |
Grubbs' test is based on z-scores, and a point is identified as an
anomaly when the associated absolute z-score is greater than a threshold value.
A vector of logical values is returned, where TRUE
indicates an anomaly.
This version of Grubbs' test looks for outliers anywhere in the sample.
Grubbs' original test came in several variations which looked for one outlier,
or two outliers in one tail, or two outliers on opposite tails. These variations
are implemented in the grubbs.test
function.
Dixon's test only considers the maximum (and possibly the minimum) as potential outliers.
A logical vector
Rob J Hyndman
Grubbs, F. E. (1950). Sample criteria for testing outlying observations. Annals of Mathematical Statistics, 21(1), 27–58. Dixon, W. J. (1950). Analysis of extreme values. Annals of Mathematical Statistics, 21(4), 488–506.
x <- c(rnorm(1000), 5:10) tibble(x = x) |> filter(grubbs_anomalies(x)) tibble(x = x) |> filter(dixon_anomalies(x)) y <- c(rnorm(1000), 5) tibble(y = y) |> filter(grubbs_anomalies(y)) tibble(y = y) |> filter(dixon_anomalies(y))
x <- c(rnorm(1000), 5:10) tibble(x = x) |> filter(grubbs_anomalies(x)) tibble(x = x) |> filter(dixon_anomalies(x)) y <- c(rnorm(1000), 5) tibble(y = y) |> filter(grubbs_anomalies(y)) tibble(y = y) |> filter(dixon_anomalies(y))
The Hampel filter is designed to find anomalies in time series data using mean absolute deviations in the vicinity of each observation.
hampel_anomalies(y, bandwidth, k = 3)
hampel_anomalies(y, bandwidth, k = 3)
y |
numeric vector containing time series |
bandwidth |
integer width of the window around each observation |
k |
numeric number of standard deviations to declare an outlier |
First, a moving median is calculated using windows of size
2 * bandwidth + 1
. Then the median absolute deviations from
this moving median are calculated in the same moving windows.
A point is declared an anomaly if its MAD is value is more than k
standard
deviations. The MAD is converted to a standard deviation using MAD * 1.482602,
which holds for normally distributed data.
The first bandwidth
and last bandwidth
observations cannot
be declared anomalies.
logical vector identifying which observations are anomalies.
Rob J Hyndman
set.seed(1) df <- tibble( time = seq(41), y = c(rnorm(20), 5, rnorm(20)) ) |> mutate(hampel = hampel_anomalies(y, bandwidth = 3, k = 4)) df |> ggplot(aes(x = time, y = y)) + geom_line() + geom_point(data = df |> filter(hampel), col = "red")
set.seed(1) df <- tibble( time = seq(41), y = c(rnorm(20), 5, rnorm(20)) ) |> mutate(hampel = hampel_anomalies(y, bandwidth = 3, k = 4)) df |> ggplot(aes(x = time, y = y)) + geom_line() + geom_point(data = df |> filter(hampel), col = "red")
Compute a table of highest density regions (HDR) for a distributional object.
The HDRs are returned as a tibble with one row per interval and columns:
prob
(giving the probability coverage),
density
(the value of the density at the boundary of the HDR),
For one dimensional density functions, the tibble also has columns
lower
(the lower ends of the intervals), and
upper
(the upper ends of the intervals).
hdr_table(object, prob)
hdr_table(object, prob)
object |
Distributional object such as that returned by |
prob |
Vector of probabilities giving the HDR coverage (between 0 and 1) |
A tibble
Rob J Hyndman
# Univariate HDRs c(dist_normal(), dist_kde(c(rnorm(100), rnorm(100, 3, 1)))) |> hdr_table(c(0.5, 0.95)) dist_kde(oldfaithful$duration) |> hdr_table(0.95) # Bivariate HDRs dist_kde(oldfaithful[, c("duration", "waiting")]) |> hdr_table(0.90)
# Univariate HDRs c(dist_normal(), dist_kde(c(rnorm(100), rnorm(100, 3, 1)))) |> hdr_table(c(0.5, 0.95)) dist_kde(oldfaithful$duration) |> hdr_table(0.95) # Bivariate HDRs dist_kde(oldfaithful[, c("duration", "waiting")]) |> hdr_table(0.90)
Robust bandwidth estimation for kernel density estimation
kde_bandwidth(data, multiplier = 1)
kde_bandwidth(data, multiplier = 1)
data |
A numeric matrix or data frame. |
multiplier |
Bandwidths are chosen using a robust version of the normal reference rule multiplied by a constant. The default is 1. |
A matrix of bandwidths (or scalar in the case of univariate data).
Rob J Hyndman
# Univariate bandwidth calculation kde_bandwidth(oldfaithful$duration) # Bivariate bandwidth calculation kde_bandwidth(oldfaithful[, 2:3])
# Univariate bandwidth calculation kde_bandwidth(oldfaithful$duration) # Bivariate bandwidth calculation kde_bandwidth(oldfaithful[, 2:3])
Compute local outlier factors using k nearest neighbours. A local
outlier factor is a measure of how anomalous each observation is based on
the density of neighbouring points.
The function uses dbscan::lof
to do the calculation.
lof_scores(y, k = 10, ...)
lof_scores(y, k = 10, ...)
y |
Numerical matrix or vector of data |
k |
Number of neighbours to include. Default: 5. |
... |
Additional arguments passed to |
Numerical vector containing LOF values
Rob J Hyndman
dbscan::lof
y <- c(rnorm(49), 5) lof_scores(y)
y <- c(rnorm(49), 5) lof_scores(y)
The lookout algorithm (Kandanaarachchi & Hyndman, 2022) computes
leave-one-out surprisal probabilities from a kernel density estimate using a
Generalized Pareto distribution. The kernel density estimate uses a
bandwidth based on topological data analysis and a quadratic kernel. So it is
similar but not identical to using surprisals
with loo = TRUE
and approximation = "gdp"
. A low probability indicates a likely anomaly.
lookout_prob(object, ...)
lookout_prob(object, ...)
object |
A numerical data set. |
... |
Other arguments are passed to |
A numerical vector containing the lookout probabilities
Rob J Hyndman
Sevvandi Kandanaarachchi & Rob J Hyndman (2022) "Leave-one-out kernel density estimates for outlier detection", J Computational & Graphical Statistics, 31(2), 586-599. https://robjhyndman.com/publications/lookout/
# Univariate data tibble( y = c(5, rnorm(49)), lookout = lookout_prob(y) ) # Bivariate data tibble( x = rnorm(50), y = c(5, rnorm(49)), lookout = lookout_prob(cbind(x, y)) ) # Using a regression model of <- oldfaithful |> filter(duration < 7200, waiting < 7200) fit_of <- lm(waiting ~ duration, data = of) broom::augment(fit_of) |> mutate(lookout = lookout_prob(.std.resid)) |> arrange(lookout)
# Univariate data tibble( y = c(5, rnorm(49)), lookout = lookout_prob(y) ) # Bivariate data tibble( x = rnorm(50), y = c(5, rnorm(49)), lookout = lookout_prob(cbind(x, y)) ) # Using a regression model of <- oldfaithful |> filter(duration < 7200, waiting < 7200) fit_of <- lm(waiting ~ duration, data = of) broom::augment(fit_of) |> mutate(lookout = lookout_prob(.std.resid)) |> arrange(lookout)
A multivariate version of base::scale()
, that takes account
of the covariance matrix of the data, and uses robust estimates
of center, scale and covariance by default. The centers are removed using medians, the
scale function is the IQR, and the covariance matrix is estimated using a
robust OGK estimate. The data are scaled using the Cholesky decomposition of
the inverse covariance. Then the scaled data are returned. This is useful for
computing pairwise Mahalanobis distances.
mvscale( object, center = stats::median, scale = robustbase::s_Qn, cov = robustbase::covOGK, warning = TRUE )
mvscale( object, center = stats::median, scale = robustbase::s_Qn, cov = robustbase::covOGK, warning = TRUE )
object |
A vector, matrix, or data frame containing some numerical data. |
center |
A function to compute the center of each numerical variable. Set to NULL if no centering is required. |
scale |
A function to scale each numerical variable. When
|
cov |
A function to compute the covariance matrix. Set to NULL if no rotation required. |
warning |
Should a warning be issued if non-numeric columns are ignored? |
Optionally, the centering and scaling can be done for each variable
separately, so there is no rotation of the data, by setting cov = NULL
.
Also optionally, non-robust methods can be used by specifying center = mean
,
scale = stats::sd
, and cov = stats::cov
. Any non-numeric columns are retained
with a warning.
A vector, matrix or data frame of the same size and class as object
,
but with numerical variables replaced by scaled versions.
Rob J Hyndman
# Univariate z-scores (no rotation) mvscale(oldfaithful, center = mean, scale = sd, cov = NULL, warning = FALSE) # Non-robust scaling with rotation mvscale(oldfaithful, center = mean, cov = stats::cov, warning = FALSE) mvscale(oldfaithful, warning = FALSE) # Robust Mahalanobis distances oldfaithful |> select(-time) |> mvscale() |> head(5) |> dist()
# Univariate z-scores (no rotation) mvscale(oldfaithful, center = mean, scale = sd, cov = NULL, warning = FALSE) # Non-robust scaling with rotation mvscale(oldfaithful, center = mean, cov = stats::cov, warning = FALSE) mvscale(oldfaithful, warning = FALSE) # Robust Mahalanobis distances oldfaithful |> select(-time) |> mvscale() |> head(5) |> dist()
A synthetic data set containing 1000 observations on 10 variables generated from independent standard normal distributions.
n01
n01
A data frame with 1000 rows and 10 columns.
Data frame
n01
n01
A data set containing data on recorded eruptions of the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA, from 1 January 2015 to 1 October 2021. Recordings are incomplete, especially during the winter months when observers may not be present.
oldfaithful
oldfaithful
A data frame with 2261 rows and 3 columns:
Time eruption started
Duration of eruption in seconds
Time to the following eruption
Data frame
oldfaithful |> filter(duration < 7000, waiting < 7000) |> ggplot(aes(x = duration, y = waiting)) + geom_point()
oldfaithful |> filter(duration < 7000, waiting < 7000) |> ggplot(aes(x = duration, y = waiting)) + geom_point()
Peirce's criterion and Chauvenet's criterion were both proposed in the 1800s as a way of determining what observations should be rejected in a univariate sample.
peirce_anomalies(y) chauvenet_anomalies(y)
peirce_anomalies(y) chauvenet_anomalies(y)
y |
numerical vector of observations |
These functions take a univariate sample y
and return a logical
vector indicating which observations should be considered anomalies according
to either Peirce's criterion or Chauvenet's criterion.
A logical vector
Rob J Hyndman
Peirce, B. (1852). Criterion for the rejection of doubtful observations. The Astronomical Journal, 2(21), 161–163.
Chauvenet, W. (1863). 'Method of least squares'. Appendix to Manual of Spherical and Practical Astronomy, Vol.2, Lippincott, Philadelphia, pp.469-566.
y <- rnorm(1000) tibble(y = y) |> filter(peirce_anomalies(y)) tibble(y = y) |> filter(chauvenet_anomalies(y))
y <- rnorm(1000) tibble(y = y) |> filter(peirce_anomalies(y)) tibble(y = y) |> filter(chauvenet_anomalies(y))
Test if observations are anomalies according to the stray algorithm.
stray_anomalies(y, ...)
stray_anomalies(y, ...)
y |
A vector, matrix, or data frame consisting of numerical variables. |
... |
Other arguments are passed to |
Numerical vector containing logical values indicating if the observation is identified as an anomaly using the stray algorithm.
Rob J Hyndman
# Univariate data y <- c(6, rnorm(49)) stray_anomalies(y) # Bivariate data y <- cbind(rnorm(50), c(5, rnorm(49))) stray_anomalies(y)
# Univariate data y <- c(6, rnorm(49)) stray_anomalies(y) # Bivariate data y <- cbind(rnorm(50), c(5, rnorm(49))) stray_anomalies(y)
Compute stray scores indicating how anomalous each observation is.
stray_scores(y, ...)
stray_scores(y, ...)
y |
A vector, matrix, or data frame consisting of numerical variables. |
... |
Other arguments are passed to |
Numerical vector containing stray scores.
Rob J Hyndman
# Univariate data y <- c(6, rnorm(49)) scores <- stray_scores(y) threshold <- stray::find_threshold(scores, alpha = 0.01, outtail = "max", p = 0.5, tn = 50) which(scores > threshold)
# Univariate data y <- c(6, rnorm(49)) scores <- stray_scores(y) threshold <- stray::find_threshold(scores, alpha = 0.01, outtail = "max", p = 0.5, tn = 50) which(scores > threshold)
Compute surprisals or surprisal probabilities from a model or a
data set. A surprisal is given by where
is the
density or probability mass function of the estimated or assumed distribution,
and
is an observation. A surprisal probability is the probability of
a surprisal at least as extreme as
.
The surprisal probabilities may be computed in three different ways.
Given the same distribution that was used to compute the surprisal values. Under this option, surprisal probabilities are equal to 1 minus the coverage probability of the largest HDR that contains each value. Surprisal probabilities smaller than 1e-6 are returned as 1e-6.
Using a Generalized Pareto Distribution fitted to the most extreme
surprisal values (those with probability less than threshold_probability
).
This option is used if approximation = "gpd"
. For surprisal probabilities
greater than threshold_probability
, the value of
threshold_probability
is returned. Under this option, the distribution is
used for computing the surprisal values but not for determining their
probabilities. Due to extreme value theory, the resulting probabilities should
be relatively insensitive to the distribution used in computing the surprisal
values.
Empirically as the proportion of observations with greater surprisal values.
This option is used when approxiation = "empirical"
. This is also
insensitive to the distribution used in computing the surprisal values.
surprisals( object, probability = TRUE, approximation = c("none", "gpd", "empirical"), threshold_probability = 0.1, ... ) ## Default S3 method: surprisals( object, probability = TRUE, approximation = c("none", "gpd", "empirical"), threshold_probability = 0.1, distribution = dist_kde(object, multiplier = 2, ...), loo = FALSE, ... )
surprisals( object, probability = TRUE, approximation = c("none", "gpd", "empirical"), threshold_probability = 0.1, ... ) ## Default S3 method: surprisals( object, probability = TRUE, approximation = c("none", "gpd", "empirical"), threshold_probability = 0.1, distribution = dist_kde(object, multiplier = 2, ...), loo = FALSE, ... )
object |
A model or numerical data set |
probability |
Should surprisal probabilities be computed, or the surprisal values? |
approximation |
Character string specifying the approximation to use in
computing the surprisal probabilities. Ignored if |
threshold_probability |
Probability threshold when computing the GPD
approximation. This is the probability below which the GPD is fitted. Only
used if |
... |
Other arguments are passed to the appropriate method. |
distribution |
A distribution object. If not provided, a kernel density
estimate is computed from the data |
loo |
Should leave-one-out surprisals be computed? |
If no distribution is provided, a kernel density estimate is computed. The leave-one-out surprisals (or LOO surprisals) are obtained by estimating the kernel density estimate using all other observations.
A numerical vector containing the surprisals or surprisal probabilities.
Rob J Hyndman
# surprisals computed from bivariate data set oldfaithful |> filter(duration < 7000, waiting < 7000) |> mutate( loo_fscores = surprisals(cbind(duration, waiting), loo = TRUE) ) # Univariate data tibble( y = c(5, rnorm(49)), p_kde = surprisals(y, loo = TRUE), p_normal = surprisals(y, distribution = dist_normal()), p_zscore = 2*(1-pnorm(abs(y))) ) tibble( y = n01$v1, prob1 = surprisals(y, loo = TRUE), prob2 = surprisals(y, approximation = "gpd"), prob3 = surprisals(y, distribution = dist_normal()), prob4 = surprisals(y, distribution = dist_normal(), approximation = "gpd") ) |> arrange(prob1) # Bivariate data tibble( x = rnorm(50), y = c(5, rnorm(49)), prob = surprisals(cbind(x, y)), lookout = lookout_prob(cbind(x, y)) )
# surprisals computed from bivariate data set oldfaithful |> filter(duration < 7000, waiting < 7000) |> mutate( loo_fscores = surprisals(cbind(duration, waiting), loo = TRUE) ) # Univariate data tibble( y = c(5, rnorm(49)), p_kde = surprisals(y, loo = TRUE), p_normal = surprisals(y, distribution = dist_normal()), p_zscore = 2*(1-pnorm(abs(y))) ) tibble( y = n01$v1, prob1 = surprisals(y, loo = TRUE), prob2 = surprisals(y, approximation = "gpd"), prob3 = surprisals(y, distribution = dist_normal()), prob4 = surprisals(y, distribution = dist_normal(), approximation = "gpd") ) |> arrange(prob1) # Bivariate data tibble( x = rnorm(50), y = c(5, rnorm(49)), prob = surprisals(cbind(x, y)), lookout = lookout_prob(cbind(x, y)) )
Surprisals computed from a model
## S3 method for class 'lm' surprisals( object, probability = TRUE, approximation = c("none", "gpd", "empirical"), threshold_probability = 0.1, loo = FALSE, ... ) ## S3 method for class 'gam' surprisals( object, probability = TRUE, approximation = c("none", "gpd", "empirical"), threshold_probability = 0.1, loo = FALSE, ... )
## S3 method for class 'lm' surprisals( object, probability = TRUE, approximation = c("none", "gpd", "empirical"), threshold_probability = 0.1, loo = FALSE, ... ) ## S3 method for class 'gam' surprisals( object, probability = TRUE, approximation = c("none", "gpd", "empirical"), threshold_probability = 0.1, loo = FALSE, ... )
object |
|
probability |
Should surprisal probabilities be computed, or the surprisal values? |
approximation |
Character string specifying the approximation to use in
computing the surprisal probabilities. Ignored if |
threshold_probability |
Probability threshold when computing the GPD
approximation. This is the probability below which the GPD is fitted. Only
used if |
loo |
Should leave-one-out surprisals be computed? |
... |
Other arguments are ignored. |
# surprisals computed from linear model of <- oldfaithful |> filter(duration < 7200, waiting < 7200) lm_of <- lm(waiting ~ duration, data = of) of |> mutate( fscore = surprisals(lm_of), loo_fscore = surprisals(lm_of, loo = TRUE), # lookout_prob = lookout(surprisals = fscore, loo_scores = loo_fscore) ) |> ggplot(aes( x = duration, y = waiting, color = loo_fscore > quantile(loo_fscore, 0.99) )) + geom_point() # surprisals computed from GAM of <- oldfaithful |> filter(duration > 1, duration < 7200, waiting < 7200) gam_of <- mgcv::gam(waiting ~ s(duration), data = of) of |> mutate(fscore = surprisals(gam_of))
# surprisals computed from linear model of <- oldfaithful |> filter(duration < 7200, waiting < 7200) lm_of <- lm(waiting ~ duration, data = of) of |> mutate( fscore = surprisals(lm_of), loo_fscore = surprisals(lm_of, loo = TRUE), # lookout_prob = lookout(surprisals = fscore, loo_scores = loo_fscore) ) |> ggplot(aes( x = duration, y = waiting, color = loo_fscore > quantile(loo_fscore, 0.99) )) + geom_point() # surprisals computed from GAM of <- oldfaithful |> filter(duration > 1, duration < 7200, waiting < 7200) gam_of <- mgcv::gam(waiting ~ s(duration), data = of) of |> mutate(fscore = surprisals(gam_of))
This function lists all the conflicts between packages in the weird collection and other packages that you have loaded.
weird_conflicts()
weird_conflicts()
Some conflicts are deliberately ignored: intersect
, union
,
setequal
, and setdiff
from dplyr; and intersect
,
union
, setdiff
, and as.difftime
from lubridate.
These functions make the base equivalents generic, so shouldn't negatively affect any
existing code.
A list object of class weird_conflicts
.
weird_conflicts()
weird_conflicts()
List all packages loaded by weird
weird_packages(include_self = FALSE)
weird_packages(include_self = FALSE)
include_self |
Include weird in the list? |
A character vector of package names.
weird_packages()
weird_packages()