| Title: | Functions and Data Sets for "That's Weird: Anomaly Detection Using R" by Rob J Hyndman |
|---|---|
| Description: | All functions and data sets required for the examples in the book Hyndman (2026) "That's Weird: Anomaly Detection Using R" <https://OTexts.com/weird/>. All packages needed to run the examples are also loaded. |
| Authors: | Rob Hyndman [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-2140-5352>), RStudio [cph] |
| Maintainer: | Rob Hyndman <[email protected]> |
| License: | GPL-3 |
| Version: | 2.1.0.9000 |
| Built: | 2026-06-02 10:17:41 UTC |
| Source: | https://github.com/robjhyndman/weird |
A dataset containing career batting statistics for all international test players (men and women) up to 6 October 2025.
cricket_battingcricket_batting
A data frame with 3968 rows and 15 variables:
Player name in form of "initials surname"
Country played for
First year of test playing career
Last year of test playing career
Number of matches played
Number of innings batted
Number of times not out
Total runs scored
Highest score in an innings
Was highest score not out?
Batting average at end of career
Total number of 100s scored
Total number of 50s scored
Total number of 0s scored
"Men" or "Women"
Data frame
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 1.4, https://OTexts.com/weird/.
cricket_batting |> filter(Innings > 20) |> select(Player, Country, Matches, Runs, Average, Hundreds, Fifties, Ducks) |> arrange(desc(Average))cricket_batting |> filter(Innings > 20) |> select(Player, Country, Matches, Runs, Average, Hundreds, Fifties, Ducks) |> arrange(desc(Average))
Make a long-form data frame containing densities from a distributional object on a regular grid for plotting.
density_df(object, ngrid = NULL)density_df(object, ngrid = NULL)
object |
A distributional object |
ngrid |
Number of grid points in each dimension. Defaults to 501 for univariate distributions and 101 for bivariate distributions. |
Data frame with columns x, y (if bivariate), density, and distribution.
dist_kde(oldfaithful$duration) |> density_df()dist_kde(oldfaithful$duration) |> density_df()
Creates a distributional object using a density specified as pair of vectors giving (x, f(x)). The density is assumed to be piecewise linear between the points provided, and 0 outside the range of x.
dist_density(x, density)dist_density(x, density)
x |
Numerical vector of ordinates, or a list of such vectors. |
density |
Numerical vector of density values, or a list of such vectors. |
A distributional object of class dist_density.
dist_density(seq(-4, 4, by = 0.01), dnorm(seq(-4, 4, by = 0.01)))dist_density(seq(-4, 4, by = 0.01), dnorm(seq(-4, 4, by = 0.01)))
Creates a distributional object using a kernel density estimate with a
Gaussian kernel obtained from the kde() function. The bandwidth
can be specified; otherwise the kde_bandwidth() function is used.
The cdf, quantiles and moments are consistent with the kde. Generating
random values from the kde is equivalent to a smoothed bootstrap.
dist_kde( y, h = NULL, H = NULL, method = c("robust", "normal", "plugin", "lookout"), ... )dist_kde( y, h = NULL, H = NULL, method = c("robust", "normal", "plugin", "lookout"), ... )
y |
Numerical vector or matrix of data, or a list of such objects. If a list is provided, then all objects should be of the same dimension. e.g., all vectors, or all matrices with the same number of columns. |
h |
Bandwidth for univariate distribution. Ignored if |
H |
Bandwidth matrix for multivariate distribution. If |
method |
The method of bandwidth estimation to use. See |
... |
Other arguments are passed to |
A distributional object of class dist_kde.
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 2.7 and 3.9, https://OTexts.com/weird/.
dist_kde(c(rnorm(200), rnorm(100, 5))) dist_kde(cbind(rnorm(200), rnorm(200, 5)))dist_kde(c(rnorm(200), rnorm(100, 5))) dist_kde(cbind(rnorm(200), rnorm(200, 5)))
Convert Gaussian mixture model to a distributional object
dist_mclust(object)dist_mclust(object)
object |
Object of class |
An object of class distributional
library(mclust) # Univariate mixture density gmm <- Mclust(oldfaithful$duration) |> dist_mclust() gg_density(gmm) + geom_jitter(data = oldfaithful, aes(x=duration, y = -0.0002), width=0, height=0.0002, alpha = 0.1) + labs(x = "Eruption duration", y = "")library(mclust) # Univariate mixture density gmm <- Mclust(oldfaithful$duration) |> dist_mclust() gg_density(gmm) + geom_jitter(data = oldfaithful, aes(x=duration, y = -0.0002), width=0, height=0.0002, alpha = 0.1) + labs(x = "Eruption duration", y = "")
Hourly air quality measurements from 12 monitoring stations across Beijing, China, from 1 March 2013 to 28 February 2017. The data are downloaded and returned.
fetch_air_quality()fetch_air_quality()
A data frame with 420,768 rows and 17 columns:
Name of the monitoring station
Year of measurement
Month of measurement
Day of measurement
Hour of measurement (0–23)
Particulate matter with diameter less than 2.5 micrometers (micrograms per cubic meter)
Particulate matter with diameter less than 10 micrometers (micrograms per cubic meter)
Sulfur dioxide concentration (micrograms per cubic meter)
Nitrogen dioxide concentration (micrograms per cubic meter)
Carbon monoxide concentration (micrograms per cubic meter)
Ozone concentration (micrograms per cubic meter)
Temperature (degrees Celsius)
Atmospheric pressure (hPa)
Dew point temperature (degrees Celsius)
Rainfall (millimeters)
Wind direction
Wind speed (meters per second)
Data frame
Chen, S. (2017). Beijing Multi-Site Air Quality Dataset. UCI Machine Learning Repository. https://doi.org/10.24432/C5RK5G
Hyndman, R J (2026) That's weird: Anomaly detection using R, https://OTexts.com/weird/.
## Not run: air_quality <- fetch_air_quality() air_quality |> filter(station == "Aotizhongxin") |> ggplot(aes(x = temperature, y = pm2_5)) + geom_point(alpha = 0.1) ## End(Not run)## Not run: air_quality <- fetch_air_quality() air_quality |> filter(station == "Aotizhongxin") |> ggplot(aes(x = temperature, y = pm2_5)) + geom_point(alpha = 0.1) ## End(Not run)
A data set containing data on wines from 44 countries, taken from Wine Enthusiast Magazine during the week of 15 June 2017. The data are downloaded and returned.
fetch_wine_reviews()fetch_wine_reviews()
A data frame with 110,203 rows and 8 columns:
Country of origin
State or province of origin
Region of origin
Name of vineyard that made the wine
Variety of grape
Year of wine
Points allocated by WineEnthusiast reviewer on a scale of 0-100
Price of a bottle of wine in $US
Data frame
https://www.kaggle.com/datasets/zynicide/wine-reviews
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 1.4, https://OTexts.com/weird/.
## Not run: wine_reviews <- fetch_wine_reviews() wine_reviews |> ggplot(aes(x = points, y = price)) + geom_jitter(height = 0, width = 0.2, alpha = 0.1) + scale_y_log10() ## End(Not run)## Not run: wine_reviews <- fetch_wine_reviews() wine_reviews |> ggplot(aes(x = points, y = price)) + geom_jitter(height = 0, width = 0.2, alpha = 0.1) + scale_y_log10() ## End(Not run)
A data set containing French mortality rates between the years 1816 and 1999, by age and sex.
fr_mortalityfr_mortality
A data frame with 31648 rows and 4 columns.
Data frame
Human Mortality Database https://www.mortality.org
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 1.4, https://OTexts.com/weird/.
fr_mortalityfr_mortality
Produces a bivariate bagplot. A bagplot is analogous to a univariate boxplot, except it is in two dimensions. Like a boxplot, it shows the median, a region containing 50% of the observations, a region showing the remaining observations other than outliers, and any outliers.
gg_bagplot(data, var1, var2, color = "#00659e", show_points = FALSE, ...)gg_bagplot(data, var1, var2, color = "#00659e", show_points = FALSE, ...)
data |
A data frame or matrix containing the data. |
var1 |
The name of the first variable to plot (a bare expression). |
var2 |
The name of the second variable to plot (a bare expression). |
color |
The base color to use for the median. Other colors are generated
as a mixture of |
show_points |
A logical argument indicating if a regular bagplot is required
( |
... |
Other arguments are passed to the |
A ggplot object showing a bagplot or scatterplot of the data.
Rob J Hyndman
Rousseeuw, P J, Ruts, I, & Tukey, J W (1999). The bagplot: A bivariate boxplot. The American Statistician, 52(4), 382–387.
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 5.6, https://OTexts.com/weird/.
gg_bagplot(n01, v1, v2) gg_bagplot(n01, v1, v2, show_points = TRUE)gg_bagplot(n01, v1, v2) gg_bagplot(n01, v1, v2, show_points = TRUE)
Produce ggplot of densities from distributional objects in 1 or 2 dimensions
gg_density( object, prob = seq(9)/10, hdr = NULL, show_points = FALSE, show_mode = FALSE, show_anomalies = FALSE, colors = c("#0072b2", "#D55E00", "#009E73", "#CC79A7", "#E69F00", "#56B4E9", "#F0E442", "#333333"), alpha = NULL, jitter = FALSE, ngrid = NULL )gg_density( object, prob = seq(9)/10, hdr = NULL, show_points = FALSE, show_mode = FALSE, show_anomalies = FALSE, colors = c("#0072b2", "#D55E00", "#009E73", "#CC79A7", "#E69F00", "#56B4E9", "#F0E442", "#333333"), alpha = NULL, jitter = FALSE, ngrid = NULL )
object |
distribution object from the distributional package or
|
prob |
Probability of the HDRs to be drawn. |
hdr |
Character string describing how the HDRs are to be shown. Options
are "none", "fill", "points" and "contours" (the latter only for bivariate plots).
If |
show_points |
If |
show_mode |
If |
show_anomalies |
If |
colors |
Color palette to use. If there are more than
|
alpha |
Transparency of points. Ignored if |
jitter |
For univariate distributions, when |
ngrid |
Number of grid points in each dimension, passed to
|
This function produces a ggplot of a density from a distributional object.
For univariate densities, it produces a line plot of the density function, with
an optional ribbon showing some highest density regions (HDRs) and/or the observations.
For bivariate densities, it produces an HDR contour plot of the density function, with
the observations optionally shown as points.
The mode can also be drawn as a point.
The combination of hdr = "fill", show_points = TRUE,
show_mode = TRUE, and prob = c(0.5, 0.99) is equivalent to showing
HDR boxplots.
A ggplot object.
Rob J Hyndman
# Univariate densities kde <- dist_kde(c(rnorm(500), rnorm(500, 4, 0.5))) gg_density(kde, hdr = "fill", prob = c(0.5, 0.95), color = "#c14b14", show_mode = TRUE, show_points = TRUE, jitter = TRUE ) c(dist_normal(), kde) |> gg_density(hdr = "fill", prob = c(0.5, 0.95)) # Bivariate density tibble(y1 = rnorm(5000), y2 = y1 + rnorm(5000)) |> dist_kde() |> gg_density(show_points = TRUE, alpha = 0.1, hdr = "fill")# Univariate densities kde <- dist_kde(c(rnorm(500), rnorm(500, 4, 0.5))) gg_density(kde, hdr = "fill", prob = c(0.5, 0.95), color = "#c14b14", show_mode = TRUE, show_points = TRUE, jitter = TRUE ) c(dist_normal(), kde) |> gg_density(hdr = "fill", prob = c(0.5, 0.95)) # Bivariate density tibble(y1 = rnorm(5000), y2 = y1 + rnorm(5000)) |> dist_kde() |> gg_density(show_points = TRUE, alpha = 0.1, hdr = "fill")
Produces a 1d or 2d box plot of HDR regions. The darker regions contain observations with higher probability, while the lighter regions contain points with lower probability. Observations outside the largest HDR are shown as individual points. Anomalies with leave-one-out surprisal probabilities less than 0.005 are optionally shown in black.
gg_hdrboxplot( data, var1, var2 = NULL, prob = c(0.5, 0.99), color = "#0072b2", show_points = FALSE, show_anomalies = TRUE, alpha = NULL, jitter = TRUE, ... )gg_hdrboxplot( data, var1, var2 = NULL, prob = c(0.5, 0.99), color = "#0072b2", show_points = FALSE, show_anomalies = TRUE, alpha = NULL, jitter = TRUE, ... )
data |
A data frame or matrix containing the data. |
var1 |
The name of the first variable to plot (a bare expression). |
var2 |
Optionally, the name of the second variable to plot (a bare expression). |
prob |
A numeric vector specifying the coverage probabilities for the HDRs. |
color |
The base color to use for the mode. Colors for the HDRs are generated by whitening this color. |
show_points |
A logical argument indicating if a regular HDR plot is required
( |
show_anomalies |
A logical argument indicating if the surprisal anomalies should be shown (in black). These are points with leave-one-out surprisal probability values less than 0.005 (using a GPD approximation), and which lie outside the 99% HDR region. |
alpha |
Transparency of points. Ignored if |
jitter |
A logical value indicating if the points should be vertically jittered for the 1d box plots to reduce overplotting. |
... |
Other arguments passed to |
The original HDR boxplot proposed by Hyndman (1996), can be produced
with show_anomalies = FALSE, jitter = FALSE, alpha = 1, and all other
arguments set to their defaults.
A ggplot object showing an HDR plot or scatterplot of the data.
Rob J Hyndman
Hyndman, R J (1996) Computing and Graphing Highest Density Regions, The American Statistician, 50(2), 120–126. https://robjhyndman.com/publications/hdr/
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 5.7, https://OTexts.com/weird/.
df <- data.frame(x = c(rnorm(1000), rnorm(1000, 5, 1), 10)) gg_hdrboxplot(df, x, show_anomalies = TRUE) cricket_batting |> filter(Innings > 20) |> gg_hdrboxplot(Average) oldfaithful |> gg_hdrboxplot(duration, waiting, show_points = TRUE)df <- data.frame(x = c(rnorm(1000), rnorm(1000, 5, 1), 10)) gg_hdrboxplot(df, x, show_anomalies = TRUE) cricket_batting |> filter(Innings > 20) |> gg_hdrboxplot(Average) oldfaithful |> gg_hdrboxplot(duration, waiting, show_points = TRUE)
Compute Global-Local Outlier Score from Hierarchies. This is based
on hierarchical clustering where the minimum cluster size is k. The resulting
outlier score is a measure of how anomalous each observation is.
The function uses dbscan::hdbscan to do the calculation.
glosh_scores(y, k = 10, ...)glosh_scores(y, k = 10, ...)
y |
Numerical matrix or vector of data |
k |
Minimum cluster size. Default: 5. |
... |
Additional arguments passed to |
Numerical vector containing GLOSH values
Rob J Hyndman
dbscan::glosh
y <- c(rnorm(49), 5) glosh_scores(y)y <- c(rnorm(49), 5) glosh_scores(y)
Grubbs' test (proposed in 1950) identifies possible anomalies in univariate data using z-scores assuming the data come from a normal distribution. Dixon's test (also from 1950) compares the difference in the largest two values to the range of the data. Critical values for Dixon's test have been computed using simulation with interpolation using a quadratic model on logit(alpha) and log(log(n)).
grubbs_anomalies(y, alpha = 0.05) dixon_anomalies(y, alpha = 0.05, two_sided = TRUE)grubbs_anomalies(y, alpha = 0.05) dixon_anomalies(y, alpha = 0.05, two_sided = TRUE)
y |
numerical vector of observations |
alpha |
size of the test. |
two_sided |
If |
Grubbs' test is based on z-scores, and a point is identified as an
anomaly when the associated absolute z-score is greater than a threshold value.
A vector of logical values is returned, where TRUE indicates an anomaly.
This version of Grubbs' test looks for outliers anywhere in the sample.
Grubbs' original test came in several variations which looked for one outlier,
or two outliers in one tail, or two outliers on opposite tails. These variations
are implemented in the grubbs.test function.
Dixon's test only considers the maximum (and possibly the minimum) as potential outliers.
A logical vector
Rob J Hyndman
Grubbs, F E (1950). Sample criteria for testing outlying observations. Annals of Mathematical Statistics, 21(1), 27–58.
Dixon, W J (1950). Analysis of extreme values. Annals of Mathematical Statistics, 21(4), 488–506.
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 4.4-4.5, https://OTexts.com/weird/.
x <- c(rnorm(1000), 5:10) tibble(x = x) |> filter(grubbs_anomalies(x)) tibble(x = x) |> filter(dixon_anomalies(x)) y <- c(rnorm(1000), 5) tibble(y = y) |> filter(grubbs_anomalies(y)) tibble(y = y) |> filter(dixon_anomalies(y))x <- c(rnorm(1000), 5:10) tibble(x = x) |> filter(grubbs_anomalies(x)) tibble(x = x) |> filter(dixon_anomalies(x)) y <- c(rnorm(1000), 5) tibble(y = y) |> filter(grubbs_anomalies(y)) tibble(y = y) |> filter(dixon_anomalies(y))
A data set containing gun ownership rates and homicide rates for 2017 for various countries around the world. The gun ownership rates are the number of guns owned by civilians per 100 people. The homicide rates are the number of homicides per 100,000 people where the weapon was a firearm.
gun_deathsgun_deaths
A data frame with 77 rows and 4 columns:
Country name
World region according to Our World in Data
Gun ownership rate (number of guns owned by civilians per 100 people)
Homicide rate (number of homicides per 100,000 people where the weapon was a firearm)
Data frame
World Population Review https://worldpopulationreview.com/country-rankings/gun-ownership-by-country and https://ourworldindata.org/grapher/homicide-rates-from-firearms
gun_deathsgun_deaths
The Hampel filter is designed to find anomalies in time series data using mean absolute deviations in the vicinity of each observation.
hampel_anomalies(y, bandwidth, k = 3)hampel_anomalies(y, bandwidth, k = 3)
y |
numeric vector containing time series |
bandwidth |
integer width of the window around each observation |
k |
numeric number of standard deviations to declare an outlier |
First, a moving median is calculated using windows of size
2 * bandwidth + 1. Then the median absolute deviations from
this moving median are calculated in the same moving windows.
A point is declared an anomaly if its MAD value is more than k standard
deviations. The MAD is converted to a standard deviation using MAD * 1.4826,
which holds for normally distributed data.
The first bandwidth and last bandwidth observations cannot
be declared anomalies.
logical vector identifying which observations are anomalies.
Rob J Hyndman
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 9.2, https://OTexts.com/weird/.
set.seed(1) df <- tibble( time = seq(41), y = c(rnorm(20), 5, rnorm(20)) ) |> mutate(hampel = hampel_anomalies(y, bandwidth = 3, k = 4)) df |> ggplot(aes(x = time, y = y)) + geom_line() + geom_point(data = df |> filter(hampel), col = "red")set.seed(1) df <- tibble( time = seq(41), y = c(rnorm(20), 5, rnorm(20)) ) |> mutate(hampel = hampel_anomalies(y, bandwidth = 3, k = 4)) df |> ggplot(aes(x = time, y = y)) + geom_line() + geom_point(data = df |> filter(hampel), col = "red")
Compute a table of highest density regions (HDR) for a distributional object.
The HDRs are returned as a tibble with one row per interval and columns:
prob (giving the probability coverage),
density (the value of the density at the boundary of the HDR),
For one dimensional density functions, the tibble also has columns
lower (the lower ends of the intervals), and
upper (the upper ends of the intervals).
hdr_table(object, prob)hdr_table(object, prob)
object |
Distributional object such as that returned by |
prob |
Vector of probabilities giving the HDR coverage (between 0 and 1) |
A tibble
Rob J Hyndman
Hyndman, R J (1996) "Computing and Graphing Highest Density Regions", The American Statistician, 50(2), 120–126. https://robjhyndman.com/publications/hdr/
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 2.5, 3.4. https://OTexts.com/weird/.
# Univariate HDRs c(dist_normal(), dist_kde(c(rnorm(100), rnorm(100, 3, 1)))) |> hdr_table(c(0.5, 0.95)) dist_kde(oldfaithful$duration) |> hdr_table(0.95) # Bivariate HDRs dist_kde(oldfaithful[, c("duration", "waiting")]) |> hdr_table(0.90)# Univariate HDRs c(dist_normal(), dist_kde(c(rnorm(100), rnorm(100, 3, 1)))) |> hdr_table(c(0.5, 0.95)) dist_kde(oldfaithful$duration) |> hdr_table(0.95) # Bivariate HDRs dist_kde(oldfaithful[, c("duration", "waiting")]) |> hdr_table(0.90)
Bandwidth matrices are estimated in several ways including a normal reference rule, a robust version of the normal reference rule (default), a plugin estimator, or using the approach of Hyndman, Kandanaarachchi & Turner (2026). Details of each method are given in Hyndman (2026).
kde_bandwidth(data, method = c("robust", "normal", "plugin", "lookout"), ...)kde_bandwidth(data, method = c("robust", "normal", "plugin", "lookout"), ...)
data |
A numeric matrix or data frame. |
method |
A character string giving the method to use. Possibilities are:
|
... |
Additional arguments are ignored. |
A matrix of bandwidths (or a scalar in the case of univariate data).
Rob J Hyndman
Hyndman, R J, Kandanaarachchi, S & Turner, K (2026) "When lookout sees crackle: Anomaly detection via kernel density estimation", unpublished. https://robjhyndman.com/publications/lookout2.html
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 2.7 and 3.9, https://OTexts.com/weird/.
# Univariate bandwidth calculation kde_bandwidth(oldfaithful$duration) # Bivariate bandwidth calculation kde_bandwidth(oldfaithful[, c("duration", "waiting")])# Univariate bandwidth calculation kde_bandwidth(oldfaithful$duration) # Bivariate bandwidth calculation kde_bandwidth(oldfaithful[, c("duration", "waiting")])
Compute local outlier factors using k nearest neighbours. A local
outlier factor is a measure of how anomalous each observation is based on
the density of neighbouring points.
The function uses dbscan::lof to do the calculation.
lof_scores(y, k = 10, ...)lof_scores(y, k = 10, ...)
y |
Numerical matrix or vector of data |
k |
Number of neighbours to include. Default: 5. |
... |
Additional arguments passed to |
Numerical vector containing LOF values
Rob J Hyndman
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 7.3, https://OTexts.com/weird/.
dbscan::lof
y <- c(rnorm(49), 5) lof_scores(y)y <- c(rnorm(49), 5) lof_scores(y)
A multivariate version of base::scale(), that takes account
of the covariance matrix of the data. By default, robust estimates are used:
the centers are removed using medians,
the scale function for univariate data is s_Qn,
and the covariance matrix for multivariate data is estimated using a robust MCD estimate.
The data are scaled using the Cholesky decomposition of
the inverse (co)variance. Then the scaled data are returned.
Details of the methods are provided by Hyndman (2026).
mvscale( object, center = stats::median, scale = robustbase::s_Qn, cov = robustbase::covMcd, alpha = 0.9, warning = TRUE, ... )mvscale( object, center = stats::median, scale = robustbase::s_Qn, cov = robustbase::covMcd, alpha = 0.9, warning = TRUE, ... )
object |
A vector, matrix, or data frame containing some numerical data. |
center |
A function to compute the center of each numerical variable. Set to NULL if no centering is required. |
scale |
A function to scale each numerical variable. When
|
cov |
A function to compute the covariance matrix. Set to NULL if no rotation required. |
alpha |
When |
warning |
Should a warning be issued if non-numeric columns are ignored? |
... |
Other arguments are passed to |
Optionally, the centering and scaling can be done for each variable
separately, by setting cov = NULL, so there is no rotation of the data,
Also optionally, non-robust methods can be used by specifying center = mean,
scale = stats::sd(), and cov = stats::cov(). Any non-numeric columns are retained
with a warning.
Missing values are removed before the centers, scale and cov are estimated.
A vector, matrix or data frame of the same size and class as object,
but with numerical variables replaced by scaled versions (renamed if they have been rotated).
Rob J Hyndman
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 2.6, 3.6 and 3.7.
base::scale(), stats::sd(), stats::cov(), robustbase::covOGK(), robustbase::s_Qn()
# Univariate z-scores z <- mvscale(oldfaithful$duration, center = mean, scale = sd) # Non-robust scaling with no rotation oldfaithful |> mvscale(center = mean, scale = sd, cov = NULL, warning = FALSE) # Non-robust scaling with rotation oldfaithful |> mvscale(center = mean, scale = sd, cov = stats::cov, warning = FALSE) # Robust scaling and rotation oldfaithful |> mvscale(warning = FALSE)# Univariate z-scores z <- mvscale(oldfaithful$duration, center = mean, scale = sd) # Non-robust scaling with no rotation oldfaithful |> mvscale(center = mean, scale = sd, cov = NULL, warning = FALSE) # Non-robust scaling with rotation oldfaithful |> mvscale(center = mean, scale = sd, cov = stats::cov, warning = FALSE) # Robust scaling and rotation oldfaithful |> mvscale(warning = FALSE)
A synthetic data set containing 1000 observations on 10 variables generated from independent standard normal distributions.
n01n01
A data frame with 1000 rows and 10 columns.
Data frame
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 1.4, https://OTexts.com/weird/.
n01n01
A data set containing data on recorded eruptions of the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA, from 14 January 2017 to 29 December 2023. Recordings are incomplete, especially during the winter months when observers may not be present.
oldfaithfuloldfaithful
A data frame with 2097 rows and 4 columns:
Time eruption started
Duration of eruption as recorded
Duration of eruption in seconds
Time to the following eruption in seconds
Data frame
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 1.4, https://OTexts.com/weird/.
oldfaithful |> ggplot(aes(x = duration, y = waiting)) + geom_point()oldfaithful |> ggplot(aes(x = duration, y = waiting)) + geom_point()
Peirce's criterion and Chauvenet's criterion were both proposed in the 1800s as a way of determining what observations should be rejected in a univariate sample.
peirce_anomalies(y) chauvenet_anomalies(y)peirce_anomalies(y) chauvenet_anomalies(y)
y |
numerical vector of observations |
These functions take a univariate sample y and return a logical
vector indicating which observations should be considered anomalies according
to either Peirce's criterion or Chauvenet's criterion.
A logical vector
Rob J Hyndman
Peirce, B (1852). Criterion for the rejection of doubtful observations. The Astronomical Journal, 2(21), 161–163.
Chauvenet, W (1863). 'Method of least squares'. Appendix to Manual of Spherical and Practical Astronomy, Vol.2, Lippincott, Philadelphia, pp.469-566.
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Section 4.3, https://OTexts.com/weird/.
y <- rnorm(1000) tibble(y = y) |> filter(peirce_anomalies(y)) tibble(y = y) |> filter(chauvenet_anomalies(y))y <- rnorm(1000) tibble(y = y) |> filter(peirce_anomalies(y)) tibble(y = y) |> filter(chauvenet_anomalies(y))
Test if observations are anomalies according to the stray algorithm.
stray_anomalies(y, ...)stray_anomalies(y, ...)
y |
A vector, matrix, or data frame consisting of numerical variables. |
... |
Other arguments are passed to |
Numerical vector containing logical values indicating if the observation is identified as an anomaly using the stray algorithm.
Rob J Hyndman
Talagala, P D, Hyndman, R J, and Smith-Miles, K (2021) Anomaly detection in high-dimensional data, Journal of Computational and Graphical Statistics, 30(2), 360-374.
# Univariate data y <- c(6, rnorm(49)) stray_anomalies(y) # Bivariate data y <- cbind(rnorm(50), c(5, rnorm(49))) stray_anomalies(y)# Univariate data y <- c(6, rnorm(49)) stray_anomalies(y) # Bivariate data y <- cbind(rnorm(50), c(5, rnorm(49))) stray_anomalies(y)
Compute stray scores indicating how anomalous each observation is.
stray_scores(y, ...)stray_scores(y, ...)
y |
A vector, matrix, or data frame consisting of numerical variables. |
... |
Other arguments are passed to |
Numerical vector containing stray scores.
Rob J Hyndman
Talagala, P D, Hyndman, R J, and Smith-Miles, K (2021) Anomaly detection in high-dimensional data, Journal of Computational and Graphical Statistics, 30(2), 360-374.
# Univariate data y <- c(6, rnorm(49)) scores <- stray_scores(y) threshold <- stray::find_threshold(scores, alpha = 0.01, outtail = "max", p = 0.5, tn = 50) which(scores > threshold)# Univariate data y <- c(6, rnorm(49)) scores <- stray_scores(y) threshold <- stray::find_threshold(scores, alpha = 0.01, outtail = "max", p = 0.5, tn = 50) which(scores > threshold)
A surprisal is given by where is the
density or probability mass function of the estimated or assumed distribution,
and is an observation. This is returned by surprisals().
A surprisal probability is the probability of a surprisal at least as extreme
as . This is returned by surprisals_prob()
surprisals(object, ...) surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, ... )surprisals(object, ...) surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, ... )
object |
A model or numerical data set |
... |
Other arguments are passed to the appropriate method. |
approximation |
Character string specifying the method to use in computing the surprisal probabilities. See Details below. |
threshold_probability |
Probability threshold when computing the GPD
approximation. This is the probability below which the GPD is fitted. Only
used if |
The surprisal probabilities may be computed in three different ways.
When approximation = "none" (the default), the surprisal probabilities are computed
using the same distribution that was used to compute the surprisal values.
Under this option, surprisal probabilities are equal to 1 minus the
coverage probability of the largest HDR that contains each value. Surprisal
probabilities smaller than 1e-6 are returned as 1e-6.
When approximation = "gdp", the surprisal probabilities are
computed using a Generalized Pareto Distribution fitted to the most extreme
surprisal values (those with probability less than threshold_probability).
For surprisal probabilities greater than threshold_probability, the value of
threshold_probability is returned. Under this option, the distribution is
used for computing the surprisal values but not for determining their
probabilities. Due to extreme value theory, the resulting probabilities should
be relatively insensitive to the distribution used in computing the surprisal
values.
When approximation = "empirical" (or "rank"), the surprisal probability of each
observation is estimated using the proportion of observations with
greater or equal surprisal values. This is a nonparametric approach that is also
insensitive to the distribution used in computing the surprisal values.
A numerical vector containing the surprisals or surprisal probabilities.
Rob J Hyndman
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Chapter 6, https://OTexts.com/weird/.
Hyndman, R J & Frazier, D T (2026) "Anomaly detection using surprisals", https://robjhyndman.com/publications/surprisals.html.
For specific methods, see surprisals.numeric() and surprisals.lm(),
A surprisal is given by where is the
density or probability mass function of the estimated or assumed distribution,
and is an observation. This is returned by surprisals().
A surprisal probability is the probability of a surprisal at least as extreme
as . This is returned by surprisals_prob()
## S3 method for class 'lm' surprisals(object, loo = FALSE, ...) ## S3 method for class 'glm' surprisals(object, ...) ## S3 method for class 'gam' surprisals(object, ...) ## S3 method for class 'lm' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, loo = FALSE, ... ) ## S3 method for class 'glm' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, ... ) ## S3 method for class 'gam' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, ... )## S3 method for class 'lm' surprisals(object, loo = FALSE, ...) ## S3 method for class 'glm' surprisals(object, ...) ## S3 method for class 'gam' surprisals(object, ...) ## S3 method for class 'lm' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, loo = FALSE, ... ) ## S3 method for class 'glm' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, ... ) ## S3 method for class 'gam' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, ... )
object |
A model object such as returned by |
loo |
Should leave-one-out surprisals be computed? For computational
reasons, this is only available for |
... |
Other arguments are ignored. |
approximation |
Character string specifying the method to use in computing the surprisal probabilities. See Details below. |
threshold_probability |
Probability threshold when computing the GPD
approximation. This is the probability below which the GPD is fitted. Only
used if |
The surprisal probabilities may be computed in three different ways.
When approximation = "none" (the default), the surprisal probabilities are computed
using the same distribution that was used to compute the surprisal values.
Under this option, surprisal probabilities are equal to 1 minus the
coverage probability of the largest HDR that contains each value. Surprisal
probabilities smaller than 1e-6 are returned as 1e-6.
When approximation = "gdp", the surprisal probabilities are
computed using a Generalized Pareto Distribution fitted to the most extreme
surprisal values (those with probability less than threshold_probability).
For surprisal probabilities greater than threshold_probability, the value of
threshold_probability is returned. Under this option, the distribution is
used for computing the surprisal values but not for determining their
probabilities. Due to extreme value theory, the resulting probabilities should
be relatively insensitive to the distribution used in computing the surprisal
values.
When approximation = "empirical" (or "rank"), the surprisal probability of each
observation is estimated using the proportion of observations with
greater or equal surprisal values. This is a nonparametric approach that is also
insensitive to the distribution used in computing the surprisal values.
A numerical vector containing the surprisals or surprisal probabilities.
Rob J Hyndman
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Chapter 6, https://OTexts.com/weird/.
Hyndman, R J & Frazier, D T (2026) "Anomaly detection using surprisals", https://robjhyndman.com/publications/surprisals.html.
For specific methods, see surprisals.numeric() and surprisals.lm(),
# A linear model (i.e., a conditional Gaussian distribution) lm_of <- lm(waiting ~ duration, data = oldfaithful) oldfaithful |> mutate( fscore = surprisals_prob(lm_of), prob = surprisals_prob(lm_of, loo = TRUE), ) |> ggplot(aes( x = duration, y = waiting, color = prob < 0.01 )) + geom_point() # A Poisson GLM glm_breaks <- glm(breaks ~ wool + tension, data = warpbreaks, family = poisson) warpbreaks |> mutate(prob = surprisals_prob(glm_breaks)) |> filter(prob < 0.05)# A linear model (i.e., a conditional Gaussian distribution) lm_of <- lm(waiting ~ duration, data = oldfaithful) oldfaithful |> mutate( fscore = surprisals_prob(lm_of), prob = surprisals_prob(lm_of, loo = TRUE), ) |> ggplot(aes( x = duration, y = waiting, color = prob < 0.01 )) + geom_point() # A Poisson GLM glm_breaks <- glm(breaks ~ wool + tension, data = warpbreaks, family = poisson) warpbreaks |> mutate(prob = surprisals_prob(glm_breaks)) |> filter(prob < 0.05)
A surprisal is given by where is the
density or probability mass function of the estimated or assumed distribution,
and is an observation. This is returned by surprisals().
A surprisal probability is the probability of a surprisal at least as extreme
as . This is returned by surprisals_prob()
## S3 method for class 'numeric' surprisals(object, distribution = dist_kde(object, ...), loo = FALSE, ...) ## S3 method for class 'matrix' surprisals(object, distribution = dist_kde(object, ...), loo = FALSE, ...) ## S3 method for class 'data.frame' surprisals(object, distribution = dist_kde(object, ...), loo = FALSE, ...) ## S3 method for class 'numeric' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, distribution = dist_kde(object, ...), loo = FALSE, ... ) ## S3 method for class 'matrix' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, distribution = dist_kde(object, ...), loo = FALSE, ... ) ## S3 method for class 'data.frame' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, distribution = dist_kde(object, ...), loo = FALSE, ... )## S3 method for class 'numeric' surprisals(object, distribution = dist_kde(object, ...), loo = FALSE, ...) ## S3 method for class 'matrix' surprisals(object, distribution = dist_kde(object, ...), loo = FALSE, ...) ## S3 method for class 'data.frame' surprisals(object, distribution = dist_kde(object, ...), loo = FALSE, ...) ## S3 method for class 'numeric' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, distribution = dist_kde(object, ...), loo = FALSE, ... ) ## S3 method for class 'matrix' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, distribution = dist_kde(object, ...), loo = FALSE, ... ) ## S3 method for class 'data.frame' surprisals_prob( object, approximation = c("none", "gpd", "empirical", "rank"), threshold_probability = 0.1, distribution = dist_kde(object, ...), loo = FALSE, ... )
object |
A numerical data set (either a vector, matrix, or a data.frame containing only numerical columns). |
distribution |
A distribution object. By default, a kernel density
estimate is computed from the data |
loo |
Should leave-one-out surprisals be computed? |
... |
Other arguments are passed to the appropriate method. |
approximation |
Character string specifying the method to use in computing the surprisal probabilities. See Details below. For a multivariate data set, it needs to be set to either "gpd" or "empirical". |
threshold_probability |
Probability threshold when computing the GPD
approximation. This is the probability below which the GPD is fitted. Only
used if |
The surprisal probabilities may be computed in three different ways.
When approximation = "none" (the default), the surprisal probabilities are computed
using the same distribution that was used to compute the surprisal values.
Under this option, surprisal probabilities are equal to 1 minus the
coverage probability of the largest HDR that contains each value. Surprisal
probabilities smaller than 1e-6 are returned as 1e-6.
When approximation = "gdp", the surprisal probabilities are
computed using a Generalized Pareto Distribution fitted to the most extreme
surprisal values (those with probability less than threshold_probability).
For surprisal probabilities greater than threshold_probability, the value of
threshold_probability is returned. Under this option, the distribution is
used for computing the surprisal values but not for determining their
probabilities. Due to extreme value theory, the resulting probabilities should
be relatively insensitive to the distribution used in computing the surprisal
values.
When approximation = "empirical" (or "rank"), the surprisal probability of each
observation is estimated using the proportion of observations with
greater or equal surprisal values. This is a nonparametric approach that is also
insensitive to the distribution used in computing the surprisal values.
A numerical vector containing the surprisals or surprisal probabilities.
Rob J Hyndman
Hyndman, R J (2026) "That's weird: Anomaly detection using R", Chapter 6, https://OTexts.com/weird/.
Hyndman, R J & Frazier, D T (2026) "Anomaly detection using surprisals", https://robjhyndman.com/publications/surprisals.html.
# Univariate data tibble( y = c(5, rnorm(49)), p_kde = surprisals_prob(y, loo = TRUE), p_normal = surprisals_prob(y, distribution = dist_normal()), p_zscore = 2 * (1 - pnorm(abs(y))) ) tibble( y = n01$v1, prob1 = surprisals_prob(y), prob2 = surprisals_prob(y, loo = TRUE), prob3 = surprisals_prob(y, distribution = dist_normal()), prob4 = surprisals_prob(y, distribution = dist_normal(), approximation = "gpd") ) |> arrange(prob1) # Bivariate data tibble( x = rnorm(50), y = c(5, rnorm(49)), prob = surprisals_prob(cbind(x, y), approximation = "gpd") ) oldfaithful |> mutate( s = surprisals(cbind(duration, waiting), loo = TRUE), p = surprisals_prob(cbind(duration, waiting), loo = TRUE, approximation = "gpd") ) |> arrange(p)# Univariate data tibble( y = c(5, rnorm(49)), p_kde = surprisals_prob(y, loo = TRUE), p_normal = surprisals_prob(y, distribution = dist_normal()), p_zscore = 2 * (1 - pnorm(abs(y))) ) tibble( y = n01$v1, prob1 = surprisals_prob(y), prob2 = surprisals_prob(y, loo = TRUE), prob3 = surprisals_prob(y, distribution = dist_normal()), prob4 = surprisals_prob(y, distribution = dist_normal(), approximation = "gpd") ) |> arrange(prob1) # Bivariate data tibble( x = rnorm(50), y = c(5, rnorm(49)), prob = surprisals_prob(cbind(x, y), approximation = "gpd") ) oldfaithful |> mutate( s = surprisals(cbind(duration, waiting), loo = TRUE), p = surprisals_prob(cbind(duration, waiting), loo = TRUE, approximation = "gpd") ) |> arrange(p)