Package 'weird' reference manual

Title:	Functions and Data Sets for "That's Weird: Anomaly Detection Using R" by Rob J Hyndman
Description:	All functions and data sets required for the examples in the book Hyndman (2024) "That's Weird: Anomaly Detection Using R" <https://OTexts.com/weird/>. All packages needed to run the examples are also loaded.
Authors:	Rob Hyndman [aut, cre, cph] , RStudio [cph]
Maintainer:	Rob Hyndman <[email protected]>
License:	GPL-3
Version:	1.0.2.9000
Built:	2025-01-20 04:26:50 UTC
Source:	https://github.com/robjhyndman/weird-package

Cricket batting data for international test players

Description

A dataset containing career batting statistics for all international test players (men and women) up to 6 October 2021.

Usage

cricket_batting
cricket_batting

Format

A data frame with 3754 rows and 15 variables:

Player: Player name in form of "initials surname"
Country: Country played for
Start: First year of test playing career
End: Last year of test playing career
Matches: Number of matches played
Innings: Number of innings batted
NotOuts: Number of times not out
Runs: Total runs scored
HighScore: Highest score in an innings
HighScoreNotOut: Was highest score not out?
Average: Batting average at end of career
Hundreds: Total number of 100s scored
Fifties: Total number of 50s scored
Ducks: Total number of 0s scored
Gender: "Men" or "Women"

Value

Data frame

Source

https://www.espncricinfo.com

Examples

cricket_batting |>
  filter(Innings > 20) |>
  select(Player, Country, Matches, Runs, Average, Hundreds, Fifties, Ducks) |>
  arrange(desc(Average))
cricket_batting |>
  filter(Innings > 20) |>
  select(Player, Country, Matches, Runs, Average, Hundreds, Fifties, Ducks) |>
  arrange(desc(Average))

Create distributional object based on a specified density

Description

Creates a distributional object using a density specified as pair of vectors giving (x, f(x)). The density is assumed to be piecewise linear between the points provided, and 0 outside the range of x.

Usage

dist_density(x, density)
dist_density(x, density)

Arguments

`x`	Numerical vector of ordinates, or a list of such vectors.
`density`	Numerical vector of density values, or a list of such vectors.

Examples

dist_density(seq(-4, 4, by = 0.01), dnorm(seq(-4, 4, by = 0.01)))

dist_density(seq(-4, 4, by = 0.01), dnorm(seq(-4, 4, by = 0.01)))

Create distributional object based on a kernel density estimate

Description

Creates a distributional object using a kernel density estimate with a Gaussian kernel obtained from the kde() function. The bandwidth can be specified; otherwise the kde_bandwidth() function is used. The cdf, quantiles and moments are consistent with the kde. Generating random values from the kde is equivalent to a smoothed bootstrap.

Usage

dist_kde(y, h = NULL, H = NULL, multiplier = 1, ...)
dist_kde(y, h = NULL, H = NULL, multiplier = 1, ...)

Arguments

`y`	Numerical vector or matrix of data, or a list of such objects. If a list is provided, then all objects should be of the same dimension. e.g., all vectors, or all matrices with the same number of columns.
`h`	Bandwidth for univariate distribution. If `NULL`, the `kde_bandwidth` function is used.
`H`	Bandwidth matrix for multivariate distribution. If `NULL`, the `kde_bandwidth` function is used.
`multiplier`	Multiplier for bandwidth passed to `kde_bandwidth`. Ignored if `h` or `H` are specified.
`...`	Other arguments are passed to `kde`.

Examples

dist_kde(c(rnorm(200), rnorm(100, 5)), multiplier = 2)
dist_kde(cbind(rnorm(200), rnorm(200, 5)))

dist_kde(c(rnorm(200), rnorm(100, 5)), multiplier = 2)
dist_kde(cbind(rnorm(200), rnorm(200, 5)))

Wine prices and points

Description

A data set containing data on wines from 44 countries, taken from Wine Enthusiast Magazine during the week of 15 June 2017. The data are downloaded and returned.

Usage

fetch_wine_reviews()
fetch_wine_reviews()

Format

A data frame with 110,203 rows and 8 columns:

country: Country of origin
state: State or province of origin
region: Region of origin
winery: Name of vineyard that made the wine
variety: Variety of grape
points: Points allocated by WineEnthusiast reviewer on a scale of 0-100
price: Price of a bottle of wine in $US
year: Year of wine extracted from title

Value

Data frame

Source

https://kaggle.com

Examples

## Not run: 
wine_reviews <- fetch_wine_reviews()
wine_reviews |>
  ggplot(aes(x = points, y = price)) +
  geom_jitter(height = 0, width = 0.2, alpha = 0.1) +
  scale_y_log10()

## End(Not run)
## Not run: 
wine_reviews <- fetch_wine_reviews()
wine_reviews |>
  ggplot(aes(x = points, y = price)) +
  geom_jitter(height = 0, width = 0.2, alpha = 0.1) +
  scale_y_log10()

## End(Not run)

French mortality rates by age and sex

Description

A data set containing French mortality rates between the years 1816 and 1999, by age and sex.

Usage

fr_mortality
fr_mortality

Format

A data frame with 31,648 rows and 4 columns.

Value

Data frame

Source

Human Mortality Database https://www.mortality.org

Examples

fr_mortality
fr_mortality

Bagplot

Description

Produces a bivariate bagplot. A bagplot is analagous to a univariate boxplot, except it is in two dimensions. Like a boxplot, it shows the median, a region containing 50% of the observations, a region showing the remaining observations other than outliers, and any outliers.

Usage

gg_bagplot(data, var1, var2, color = "#00659e", scatterplot = FALSE, ...)
gg_bagplot(data, var1, var2, color = "#00659e", scatterplot = FALSE, ...)

Arguments

`data`	A data frame or matrix containing the data.
`var1`	The name of the first variable to plot (a bare expression).
`var2`	The name of the second variable to plot (a bare expression).
`color`	The base color to use for the median. Other colors are generated as a mixture of `color` with white.
`scatterplot`	A logical argument indicating if a regular bagplot is required (`FALSE`), or if a scatterplot in the same colors is required (`TRUE`).
`...`	Other arguments are passed to the `compute.bagplot` function.

Value

A ggplot object showing a bagplot or scatterplot of the data.

Author(s)

Rob J Hyndman

References

Rousseeuw, P. J., Ruts, I., & Tukey, J. W. (1999). The bagplot: A bivariate boxplot. The American Statistician, 52(4), 382–387.

Examples

gg_bagplot(n01, v1, v2)
gg_bagplot(n01, v1, v2, scatterplot = TRUE)
gg_bagplot(n01, v1, v2)
gg_bagplot(n01, v1, v2, scatterplot = TRUE)

Produce ggplot of densities from distributional objects in 1 or 2 dimensions

Description

Produce ggplot of densities from distributional objects in 1 or 2 dimensions

Usage

gg_density(
  object,
  prob = seq(9)/10,
  hdr = NULL,
  show_points = FALSE,
  show_mode = FALSE,
  show_anomalies = FALSE,
  colors = c("#0072b2", "#D55E00", "#009E73", "#CC79A7", "#E69F00", "#56B4E9", "#F0E442",
    "#333333"),
  alpha = NULL,
  jitter = FALSE,
  ngrid = 501
)
gg_density(
  object,
  prob = seq(9)/10,
  hdr = NULL,
  show_points = FALSE,
  show_mode = FALSE,
  show_anomalies = FALSE,
  colors = c("#0072b2", "#D55E00", "#009E73", "#CC79A7", "#E69F00", "#56B4E9", "#F0E442",
    "#333333"),
  alpha = NULL,
  jitter = FALSE,
  ngrid = 501
)

Arguments

`object`	distribution object from the distributional package or `dist_kde`()
`prob`	Probability of the HDRs to be drawn.
`hdr`	Character string describing how the HDRs are to be shown. Options are "none", "fill", "points" and "contours" (the latter only for bivariate plots). If `NULL`, then "none" is used for univariate distributions and "contours" for bivariate.
`show_points`	If `TRUE`, then individual observations are plotted.
`show_mode`	If `TRUE`, then the mode of the distribution is shown as a point.
`show_anomalies`	If `TRUE`, then the observations with surprisal probabilities less than 0.005 are shown in black.
`colors`	Color palette to use. If there are more than `length(colors)` distributions, they are recycled. Default is the Okabe-Ito color palette.
`alpha`	Transparency of points. Ignored if `show_points` is `FALSE`. Defaults to min(1, 500/n), where n is the number of observations plotted.
`jitter`	For univariate distributions, when `jitter` is `TRUE` and `show_points` is TRUE, a small amount of vertical jittering is applied to the observations. Ignored for bivariate distributions.
`ngrid`	Number of grid points to use for the density function.

Details

This function produces a ggplot of a density from a distributional object. For univariate densities, it produces a line plot of the density function, with an optional ribbon showing some highest density regions (HDRs) and/or the observations. For bivariate densities, it produces ah HDR contour plot of the density function, with the observations optionally shown as points. The mode can also be drawn as a point. The combination of hdr = "fill", show_points = TRUE, show_mode = TRUE, and prob = c(0.5, 0.99) is equivalent to showing HDR boxplots.

Value

A ggplot object.

Author(s)

Rob J Hyndman

Examples

# Univariate densities
kde <- dist_kde(c(rnorm(500), rnorm(500, 4, .5)))
gg_density(kde,
  hdr = "fill", prob = c(0.5, 0.95), color = "#c14b14",
  show_mode = TRUE, show_points = TRUE, jitter = TRUE
)
c(dist_normal(), kde) |>
  gg_density(hdr = "fill", prob = c(0.5, 0.95))
# Bivariate density
tibble(y1 = rnorm(5000), y2 = y1 + rnorm(5000)) |>
  dist_kde() |>
  gg_density(show_points = TRUE, alpha = 0.1, hdr = "fill")
# Univariate densities
kde <- dist_kde(c(rnorm(500), rnorm(500, 4, .5)))
gg_density(kde,
  hdr = "fill", prob = c(0.5, 0.95), color = "#c14b14",
  show_mode = TRUE, show_points = TRUE, jitter = TRUE
)
c(dist_normal(), kde) |>
  gg_density(hdr = "fill", prob = c(0.5, 0.95))
# Bivariate density
tibble(y1 = rnorm(5000), y2 = y1 + rnorm(5000)) |>
  dist_kde() |>
  gg_density(show_points = TRUE, alpha = 0.1, hdr = "fill")

Add ggplot layer of densities from distributional objects in 1 dimension

Description

Add ggplot layer of densities from distributional objects in 1 dimension

Usage

gg_density_layer(object, scale = 1, ngrid = 501, ...)
gg_density_layer(object, scale = 1, ngrid = 501, ...)

Arguments

`object`	distribution object from the distributional package or `dist_kde`()
`scale`	Scaling factor for the density function.
`ngrid`	Number of grid points to use for the density function.
`...`	Additional arguments are passed to `geom_line`.

Details

This function adds a ggplot layer of a density from a distributional object. For univariate densities, it adds a line plot of the density function. For bivariate densities, it adds a contour plot of the density function.

Value

A ggplot layer

Author(s)

Rob J Hyndman

Examples

dist_mixture(
  dist_normal(-2, 1),
  dist_normal(2, 1),
  weights = c(1 / 3, 2 / 3)
) |>
  gg_density() +
  gg_density_layer(dist_normal(-2, 1), linetype = "dashed", scale = 1 / 3) +
  gg_density_layer(dist_normal(2, 1), linetype = "dashed", scale = 2 / 3)
dist_mixture(
  dist_normal(-2, 1),
  dist_normal(2, 1),
  weights = c(1 / 3, 2 / 3)
) |>
  gg_density() +
  gg_density_layer(dist_normal(-2, 1), linetype = "dashed", scale = 1 / 3) +
  gg_density_layer(dist_normal(2, 1), linetype = "dashed", scale = 2 / 3)

HDR plot

Description

Produces a 1d or 2d box plot of HDR regions. The darker regions contain observations with higher probability, while the lighter regions contain points with lower probability. Observations outside the largest HDR are shown as individual points. Anomalies with leave-one-out surprisal probabilities less than 0.005 are optionally shown in black.

Usage

gg_hdrboxplot(
  data,
  var1,
  var2 = NULL,
  prob = c(0.5, 0.99),
  color = "#0072b2",
  show_points = FALSE,
  show_anomalies = TRUE,
  scatterplot = show_points,
  alpha = NULL,
  jitter = TRUE,
  ngrid = 501,
  ...
)
gg_hdrboxplot(
  data,
  var1,
  var2 = NULL,
  prob = c(0.5, 0.99),
  color = "#0072b2",
  show_points = FALSE,
  show_anomalies = TRUE,
  scatterplot = show_points,
  alpha = NULL,
  jitter = TRUE,
  ngrid = 501,
  ...
)

Arguments

`data`	A data frame or matrix containing the data.
`var1`	The name of the first variable to plot (a bare expression).
`var2`	Optionally, the name of the second variable to plot (a bare expression).
`prob`	A numeric vector specifying the coverage probabilities for the HDRs.
`color`	The base color to use for the mode. Colors for the HDRs are generated by whitening this color.
`show_points`	A logical argument indicating if a regular HDR plot is required (`FALSE`), or whether to show the individual observations in the same colors (`TRUE`).
`show_anomalies`	A logical argument indicating if the surprisal anomalies should be shown (in black). These are points with leave-one-out surprisal probability values less than 0.005, and which lie outside the 99% HDR region.
`scatterplot`	Equivalent to `show_points`. Included for compatibility with `gg_bagplot()`.
`alpha`	Transparency of points. Ignored if `show_points` is `FALSE`. Defaults to min(1, 500/n), where n is the number of observations plotted.
`jitter`	A logical value indicating if the points should be vertically jittered for the 1d box plots to reduce overplotting.
`ngrid`	Number of grid points to use for the density function.
`...`	Other arguments passed to `dist_kde`.

Details

The original HDR boxplot proposed by Hyndman (1996), can be produced with show_anomalies = FALSE, jitter = FALSE, alpha = 1, and all other arguments set to their defaults.

Value

A ggplot object showing an HDR plot or scatterplot of the data.

Author(s)

Rob J Hyndman

References

Hyndman, R J (1996) Computing and Graphing Highest Density Regions, The American Statistician, 50(2), 120–126. https://robjhyndman.com/publications/hdr/

Examples

df <- data.frame(x = c(rnorm(1000), rnorm(1000, 5, 1), 10))
gg_hdrboxplot(df, x, show_anomalies = TRUE)
cricket_batting |>
  filter(Innings > 20) |>
  gg_hdrboxplot(Average)
oldfaithful |>
  filter(duration < 7000, waiting < 7000) |>
  gg_hdrboxplot(duration, waiting, show_points = TRUE)

df <- data.frame(x = c(rnorm(1000), rnorm(1000, 5, 1), 10))
gg_hdrboxplot(df, x, show_anomalies = TRUE)
cricket_batting |>
  filter(Innings > 20) |>
  gg_hdrboxplot(Average)
oldfaithful |>
  filter(duration < 7000, waiting < 7000) |>
  gg_hdrboxplot(duration, waiting, show_points = TRUE)

GLOSH scores

Description

Compute Global-Local Outlier Score from Hierarchies. This is based on hierarchical clustering where the minimum cluster size is k. The resulting outlier score is a measure of how anomalous each observation is. The function uses dbscan::hdbscan to do the calculation.

Usage

glosh_scores(y, k = 10, ...)
glosh_scores(y, k = 10, ...)

Arguments

`y`	Numerical matrix or vector of data
`k`	Minimum cluster size. Default: 5.
`...`	Additional arguments passed to `dbscan::hdbscan`

Value

Numerical vector containing GLOSH values

Author(s)

Rob J Hyndman

Examples

y <- c(rnorm(49), 5)
glosh_scores(y)
y <- c(rnorm(49), 5)
glosh_scores(y)

Statistical tests for anomalies using Grubbs' test and Dixon's test

Description

Grubbs' test (proposed in 1950) identifies possible anomalies in univariate data using z-scores assuming the data come from a normal distribution. Dixon's test (also from 1950) compares the difference in the largest two values to the range of the data. Critical values for Dixon's test have been computed using simulation with interpolation using a quadratic model on logit(alpha) and log(log(n)).

Usage

grubbs_anomalies(y, alpha = 0.05)

dixon_anomalies(y, alpha = 0.05, two_sided = TRUE)
grubbs_anomalies(y, alpha = 0.05)

dixon_anomalies(y, alpha = 0.05, two_sided = TRUE)

Arguments

`y`	numerical vector of observations
`alpha`	size of the test.
`two_sided`	If `TRUE`, both minimum and maximums will be considered. Otherwise only the maximum will be used. (Take negative values to consider only the minimum with `two_sided=FALSE`.)

Details

Grubbs' test is based on z-scores, and a point is identified as an anomaly when the associated absolute z-score is greater than a threshold value. A vector of logical values is returned, where TRUE indicates an anomaly. This version of Grubbs' test looks for outliers anywhere in the sample. Grubbs' original test came in several variations which looked for one outlier, or two outliers in one tail, or two outliers on opposite tails. These variations are implemented in the grubbs.test function. Dixon's test only considers the maximum (and possibly the minimum) as potential outliers.

Value

A logical vector

Author(s)

Rob J Hyndman

References

Grubbs, F. E. (1950). Sample criteria for testing outlying observations. Annals of Mathematical Statistics, 21(1), 27–58. Dixon, W. J. (1950). Analysis of extreme values. Annals of Mathematical Statistics, 21(4), 488–506.

Examples

x <- c(rnorm(1000), 5:10)
tibble(x = x) |> filter(grubbs_anomalies(x))
tibble(x = x) |> filter(dixon_anomalies(x))
y <- c(rnorm(1000), 5)
tibble(y = y) |> filter(grubbs_anomalies(y))
tibble(y = y) |> filter(dixon_anomalies(y))
x <- c(rnorm(1000), 5:10)
tibble(x = x) |> filter(grubbs_anomalies(x))
tibble(x = x) |> filter(dixon_anomalies(x))
y <- c(rnorm(1000), 5)
tibble(y = y) |> filter(grubbs_anomalies(y))
tibble(y = y) |> filter(dixon_anomalies(y))

Identify anomalies using the Hampel filter

Description

The Hampel filter is designed to find anomalies in time series data using mean absolute deviations in the vicinity of each observation.

Usage

hampel_anomalies(y, bandwidth, k = 3)
hampel_anomalies(y, bandwidth, k = 3)

Arguments

`y`	numeric vector containing time series
`bandwidth`	integer width of the window around each observation
`k`	numeric number of standard deviations to declare an outlier

Details

First, a moving median is calculated using windows of size 2 * bandwidth + 1. Then the median absolute deviations from this moving median are calculated in the same moving windows. A point is declared an anomaly if its MAD is value is more than k standard deviations. The MAD is converted to a standard deviation using MAD * 1.482602, which holds for normally distributed data. The first bandwidth and last bandwidth observations cannot be declared anomalies.

Value

logical vector identifying which observations are anomalies.

Author(s)

Rob J Hyndman

Examples

set.seed(1)
df <- tibble(
  time = seq(41),
  y = c(rnorm(20), 5, rnorm(20))
) |>
  mutate(hampel = hampel_anomalies(y, bandwidth = 3, k = 4))
df |> ggplot(aes(x = time, y = y)) +
  geom_line() +
  geom_point(data = df |> filter(hampel), col = "red")
set.seed(1)
df <- tibble(
  time = seq(41),
  y = c(rnorm(20), 5, rnorm(20))
) |>
  mutate(hampel = hampel_anomalies(y, bandwidth = 3, k = 4))
df |> ggplot(aes(x = time, y = y)) +
  geom_line() +
  geom_point(data = df |> filter(hampel), col = "red")

Table of Highest Density Regions

Description

Compute a table of highest density regions (HDR) for a distributional object. The HDRs are returned as a tibble with one row per interval and columns: prob (giving the probability coverage), density (the value of the density at the boundary of the HDR), For one dimensional density functions, the tibble also has columns lower (the lower ends of the intervals), and upper (the upper ends of the intervals).

Usage

hdr_table(object, prob)
hdr_table(object, prob)

Arguments

`object`	Distributional object such as that returned by `dist_kde()`
`prob`	Vector of probabilities giving the HDR coverage (between 0 and 1)

Value

A tibble

Author(s)

Rob J Hyndman

Examples

# Univariate HDRs
c(dist_normal(), dist_kde(c(rnorm(100), rnorm(100, 3, 1)))) |>
  hdr_table(c(0.5, 0.95))
dist_kde(oldfaithful$duration) |> hdr_table(0.95)
# Bivariate HDRs
dist_kde(oldfaithful[, c("duration", "waiting")]) |> hdr_table(0.90)
# Univariate HDRs
c(dist_normal(), dist_kde(c(rnorm(100), rnorm(100, 3, 1)))) |>
  hdr_table(c(0.5, 0.95))
dist_kde(oldfaithful$duration) |> hdr_table(0.95)
# Bivariate HDRs
dist_kde(oldfaithful[, c("duration", "waiting")]) |> hdr_table(0.90)

Robust bandwidth estimation for kernel density estimation

Description

Robust bandwidth estimation for kernel density estimation

Usage

kde_bandwidth(data, multiplier = 1)
kde_bandwidth(data, multiplier = 1)

Arguments

`data`	A numeric matrix or data frame.
`multiplier`	Bandwidths are chosen using a robust version of the normal reference rule multiplied by a constant. The default is 1.

Value

A matrix of bandwidths (or scalar in the case of univariate data).

Author(s)

Rob J Hyndman

Examples

# Univariate bandwidth calculation
kde_bandwidth(oldfaithful$duration)
# Bivariate bandwidth calculation
kde_bandwidth(oldfaithful[, 2:3])
# Univariate bandwidth calculation
kde_bandwidth(oldfaithful$duration)
# Bivariate bandwidth calculation
kde_bandwidth(oldfaithful[, 2:3])

Local outlier factors

Description

Compute local outlier factors using k nearest neighbours. A local outlier factor is a measure of how anomalous each observation is based on the density of neighbouring points. The function uses dbscan::lof to do the calculation.

Usage

lof_scores(y, k = 10, ...)
lof_scores(y, k = 10, ...)

Arguments

`y`	Numerical matrix or vector of data
`k`	Number of neighbours to include. Default: 5.
`...`	Additional arguments passed to `dbscan::lof`

Value

Numerical vector containing LOF values

Author(s)

Rob J Hyndman

Examples

y <- c(rnorm(49), 5)
lof_scores(y)
y <- c(rnorm(49), 5)
lof_scores(y)

Lookout probabilities

Description

The lookout algorithm (Kandanaarachchi & Hyndman, 2022) computes leave-one-out surprisal probabilities from a kernel density estimate using a Generalized Pareto distribution. The kernel density estimate uses a bandwidth based on topological data analysis and a quadratic kernel. So it is similar but not identical to using surprisals with loo = TRUE and approximation = "gdp". A low probability indicates a likely anomaly.

Usage

lookout_prob(object, ...)
lookout_prob(object, ...)

Arguments

`object`	A numerical data set.
`...`	Other arguments are passed to `lookout`.

Value

A numerical vector containing the lookout probabilities

Author(s)

Rob J Hyndman

References

Sevvandi Kandanaarachchi & Rob J Hyndman (2022) "Leave-one-out kernel density estimates for outlier detection", J Computational & Graphical Statistics, 31(2), 586-599. https://robjhyndman.com/publications/lookout/

Examples

# Univariate data
tibble(
  y = c(5, rnorm(49)),
  lookout = lookout_prob(y)
)
# Bivariate data
tibble(
  x = rnorm(50),
  y = c(5, rnorm(49)),
  lookout = lookout_prob(cbind(x, y))
)
# Using a regression model
of <- oldfaithful |> filter(duration < 7200, waiting < 7200)
fit_of <- lm(waiting ~ duration, data = of)
broom::augment(fit_of) |>
  mutate(lookout = lookout_prob(.std.resid)) |>
  arrange(lookout)
# Univariate data
tibble(
  y = c(5, rnorm(49)),
  lookout = lookout_prob(y)
)
# Bivariate data
tibble(
  x = rnorm(50),
  y = c(5, rnorm(49)),
  lookout = lookout_prob(cbind(x, y))
)
# Using a regression model
of <- oldfaithful |> filter(duration < 7200, waiting < 7200)
fit_of <- lm(waiting ~ duration, data = of)
broom::augment(fit_of) |>
  mutate(lookout = lookout_prob(.std.resid)) |>
  arrange(lookout)

Compute robust multivariate scaled data

Description

A multivariate version of base::scale(), that takes account of the covariance matrix of the data, and uses robust estimates of center, scale and covariance by default. The centers are removed using medians, the scale function is the IQR, and the covariance matrix is estimated using a robust OGK estimate. The data are scaled using the Cholesky decomposition of the inverse covariance. Then the scaled data are returned. This is useful for computing pairwise Mahalanobis distances.

Usage

mvscale(
  object,
  center = stats::median,
  scale = robustbase::s_Qn,
  cov = robustbase::covOGK,
  warning = TRUE
)
mvscale(
  object,
  center = stats::median,
  scale = robustbase::s_Qn,
  cov = robustbase::covOGK,
  warning = TRUE
)

Arguments

`object`	A vector, matrix, or data frame containing some numerical data.
`center`	A function to compute the center of each numerical variable. Set to NULL if no centering is required.
`scale`	A function to scale each numerical variable. When `cov = robustbase::covOGK`, it is passed as the `sigmamu` argument.
`cov`	A function to compute the covariance matrix. Set to NULL if no rotation required.
`warning`	Should a warning be issued if non-numeric columns are ignored?

Details

Optionally, the centering and scaling can be done for each variable separately, so there is no rotation of the data, by setting cov = NULL. Also optionally, non-robust methods can be used by specifying center = mean, scale = stats::sd, and cov = stats::cov. Any non-numeric columns are retained with a warning.

Value

A vector, matrix or data frame of the same size and class as object, but with numerical variables replaced by scaled versions.

Author(s)

Rob J Hyndman

Examples

# Univariate z-scores (no rotation)
mvscale(oldfaithful, center = mean, scale = sd, cov = NULL, warning = FALSE)
# Non-robust scaling with rotation
mvscale(oldfaithful, center = mean, cov = stats::cov, warning = FALSE)
mvscale(oldfaithful, warning = FALSE)
# Robust Mahalanobis distances
oldfaithful |>
  select(-time) |>
  mvscale() |>
  head(5) |>
  dist()
# Univariate z-scores (no rotation)
mvscale(oldfaithful, center = mean, scale = sd, cov = NULL, warning = FALSE)
# Non-robust scaling with rotation
mvscale(oldfaithful, center = mean, cov = stats::cov, warning = FALSE)
mvscale(oldfaithful, warning = FALSE)
# Robust Mahalanobis distances
oldfaithful |>
  select(-time) |>
  mvscale() |>
  head(5) |>
  dist()

Multivariate standard normal data

Description

A synthetic data set containing 1000 observations on 10 variables generated from independent standard normal distributions.

Usage

n01
n01

Format

A data frame with 1000 rows and 10 columns.

Value

Data frame

Examples

n01
n01

Old faithful eruption data

Description

A data set containing data on recorded eruptions of the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA, from 1 January 2015 to 1 October 2021. Recordings are incomplete, especially during the winter months when observers may not be present.

Usage

oldfaithful
oldfaithful

Format

A data frame with 2261 rows and 3 columns:

time: Time eruption started
duration: Duration of eruption in seconds
waiting: Time to the following eruption

Value

Data frame

Source

https://geysertimes.org

Examples

oldfaithful |>
  filter(duration < 7000, waiting < 7000) |>
  ggplot(aes(x = duration, y = waiting)) +
  geom_point()
oldfaithful |>
  filter(duration < 7000, waiting < 7000) |>
  ggplot(aes(x = duration, y = waiting)) +
  geom_point()

Anomalies according to Peirce's and Chauvenet's criteria

Description

Peirce's criterion and Chauvenet's criterion were both proposed in the 1800s as a way of determining what observations should be rejected in a univariate sample.

Usage

peirce_anomalies(y)

chauvenet_anomalies(y)
peirce_anomalies(y)

chauvenet_anomalies(y)

Arguments

`y`	numerical vector of observations

Details

These functions take a univariate sample y and return a logical vector indicating which observations should be considered anomalies according to either Peirce's criterion or Chauvenet's criterion.

Value

A logical vector

Author(s)

Rob J Hyndman

References

Peirce, B. (1852). Criterion for the rejection of doubtful observations. The Astronomical Journal, 2(21), 161–163.

Chauvenet, W. (1863). 'Method of least squares'. Appendix to Manual of Spherical and Practical Astronomy, Vol.2, Lippincott, Philadelphia, pp.469-566.

Examples

y <- rnorm(1000)
tibble(y = y) |> filter(peirce_anomalies(y))
tibble(y = y) |> filter(chauvenet_anomalies(y))
y <- rnorm(1000)
tibble(y = y) |> filter(peirce_anomalies(y))
tibble(y = y) |> filter(chauvenet_anomalies(y))

Stray anomalies

Description

Test if observations are anomalies according to the stray algorithm.

Usage

stray_anomalies(y, ...)
stray_anomalies(y, ...)

Arguments

`y`	A vector, matrix, or data frame consisting of numerical variables.
`...`	Other arguments are passed to `find_HDoutliers`.

Value

Numerical vector containing logical values indicating if the observation is identified as an anomaly using the stray algorithm.

Author(s)

Rob J Hyndman

Examples

# Univariate data
y <- c(6, rnorm(49))
stray_anomalies(y)
# Bivariate data
y <- cbind(rnorm(50), c(5, rnorm(49)))
stray_anomalies(y)
# Univariate data
y <- c(6, rnorm(49))
stray_anomalies(y)
# Bivariate data
y <- cbind(rnorm(50), c(5, rnorm(49)))
stray_anomalies(y)

Stray scores

Description

Compute stray scores indicating how anomalous each observation is.

Usage

stray_scores(y, ...)
stray_scores(y, ...)

Arguments

`y`	A vector, matrix, or data frame consisting of numerical variables.
`...`	Other arguments are passed to `find_HDoutliers`.

Value

Numerical vector containing stray scores.

Author(s)

Rob J Hyndman

Examples

# Univariate data
y <- c(6, rnorm(49))
scores <- stray_scores(y)
threshold <- stray::find_threshold(scores, alpha = 0.01, outtail = "max", p = 0.5, tn = 50)
which(scores > threshold)
# Univariate data
y <- c(6, rnorm(49))
scores <- stray_scores(y)
threshold <- stray::find_threshold(scores, alpha = 0.01, outtail = "max", p = 0.5, tn = 50)
which(scores > threshold)

Surprisals

Description

Compute surprisals or surprisal probabilities from a model or a data set. A surprisal is given by $s = -\log f(y)$ where $f$ is the density or probability mass function of the estimated or assumed distribution, and $y$ is an observation. A surprisal probability is the probability of a surprisal at least as extreme as $s$ .

The surprisal probabilities may be computed in three different ways.

Given the same distribution that was used to compute the surprisal values. Under this option, surprisal probabilities are equal to 1 minus the coverage probability of the largest HDR that contains each value. Surprisal probabilities smaller than 1e-6 are returned as 1e-6.
Using a Generalized Pareto Distribution fitted to the most extreme surprisal values (those with probability less than threshold_probability). This option is used if approximation = "gpd". For surprisal probabilities greater than threshold_probability, the value of threshold_probability is returned. Under this option, the distribution is used for computing the surprisal values but not for determining their probabilities. Due to extreme value theory, the resulting probabilities should be relatively insensitive to the distribution used in computing the surprisal values.
Empirically as the proportion of observations with greater surprisal values. This option is used when approxiation = "empirical". This is also insensitive to the distribution used in computing the surprisal values.

Usage

surprisals(
  object,
  probability = TRUE,
  approximation = c("none", "gpd", "empirical"),
  threshold_probability = 0.1,
  ...
)

## Default S3 method:
surprisals(
  object,
  probability = TRUE,
  approximation = c("none", "gpd", "empirical"),
  threshold_probability = 0.1,
  distribution = dist_kde(object, multiplier = 2, ...),
  loo = FALSE,
  ...
)
surprisals(
  object,
  probability = TRUE,
  approximation = c("none", "gpd", "empirical"),
  threshold_probability = 0.1,
  ...
)

## Default S3 method:
surprisals(
  object,
  probability = TRUE,
  approximation = c("none", "gpd", "empirical"),
  threshold_probability = 0.1,
  distribution = dist_kde(object, multiplier = 2, ...),
  loo = FALSE,
  ...
)

Arguments

`object`	A model or numerical data set
`probability`	Should surprisal probabilities be computed, or the surprisal values?
`approximation`	Character string specifying the approximation to use in computing the surprisal probabilities. Ignored if `probability = FALSE`. : `none` specifies that no approximation is to be used; `gpd` specifies that the Generalized Pareto distribution should be used; while `empirical` specifies that the probabilities should be estimated empirically.
`threshold_probability`	Probability threshold when computing the GPD approximation. This is the probability below which the GPD is fitted. Only used if `approximation = "gpd"`).
`...`	Other arguments are passed to the appropriate method.
`distribution`	A distribution object. If not provided, a kernel density estimate is computed from the data `object`.
`loo`	Should leave-one-out surprisals be computed?

Details

If no distribution is provided, a kernel density estimate is computed. The leave-one-out surprisals (or LOO surprisals) are obtained by estimating the kernel density estimate using all other observations.

Value

A numerical vector containing the surprisals or surprisal probabilities.

Author(s)

Rob J Hyndman

Examples

# surprisals computed from bivariate data set
oldfaithful |>
  filter(duration < 7000, waiting < 7000) |>
  mutate(
    loo_fscores = surprisals(cbind(duration, waiting), loo = TRUE)
  )
# Univariate data
tibble(
  y = c(5, rnorm(49)),
  p_kde = surprisals(y, loo = TRUE),
  p_normal = surprisals(y, distribution = dist_normal()),
  p_zscore = 2 * (1 - pnorm(abs(y)))
)
tibble(
  y = n01$v1,
  prob1 = surprisals(y, loo = TRUE),
  prob2 = surprisals(y, approximation = "gpd"),
  prob3 = surprisals(y, distribution = dist_normal()),
  prob4 = surprisals(y, distribution = dist_normal(), approximation = "gpd")
) |>
  arrange(prob1)
# Bivariate data
tibble(
  x = rnorm(50),
  y = c(5, rnorm(49)),
  prob = surprisals(cbind(x, y)),
  lookout = lookout_prob(cbind(x, y))
)
# surprisals computed from bivariate data set
oldfaithful |>
  filter(duration < 7000, waiting < 7000) |>
  mutate(
    loo_fscores = surprisals(cbind(duration, waiting), loo = TRUE)
  )
# Univariate data
tibble(
  y = c(5, rnorm(49)),
  p_kde = surprisals(y, loo = TRUE),
  p_normal = surprisals(y, distribution = dist_normal()),
  p_zscore = 2 * (1 - pnorm(abs(y)))
)
tibble(
  y = n01$v1,
  prob1 = surprisals(y, loo = TRUE),
  prob2 = surprisals(y, approximation = "gpd"),
  prob3 = surprisals(y, distribution = dist_normal()),
  prob4 = surprisals(y, distribution = dist_normal(), approximation = "gpd")
) |>
  arrange(prob1)
# Bivariate data
tibble(
  x = rnorm(50),
  y = c(5, rnorm(49)),
  prob = surprisals(cbind(x, y)),
  lookout = lookout_prob(cbind(x, y))
)

Surprisals computed from a model

Description

Surprisals computed from a model

Usage

## S3 method for class 'lm'
surprisals(
  object,
  probability = TRUE,
  approximation = c("none", "gpd", "empirical"),
  threshold_probability = 0.1,
  loo = FALSE,
  ...
)

## S3 method for class 'gam'
surprisals(
  object,
  probability = TRUE,
  approximation = c("none", "gpd", "empirical"),
  threshold_probability = 0.1,
  loo = FALSE,
  ...
)
## S3 method for class 'lm'
surprisals(
  object,
  probability = TRUE,
  approximation = c("none", "gpd", "empirical"),
  threshold_probability = 0.1,
  loo = FALSE,
  ...
)

## S3 method for class 'gam'
surprisals(
  object,
  probability = TRUE,
  approximation = c("none", "gpd", "empirical"),
  threshold_probability = 0.1,
  loo = FALSE,
  ...
)

Arguments

`object`	A model object such as returned by `lm`, or `gam`.
`probability`	Should surprisal probabilities be computed, or the surprisal values?
`approximation`	Character string specifying the approximation to use in computing the surprisal probabilities. Ignored if `probability = FALSE`. : `none` specifies that no approximation is to be used; `gpd` specifies that the Generalized Pareto distribution should be used; while `empirical` specifies that the probabilities should be estimated empirically.
`threshold_probability`	Probability threshold when computing the GPD approximation. This is the probability below which the GPD is fitted. Only used if `approximation = "gpd"`).
`loo`	Should leave-one-out surprisals be computed?
`...`	Other arguments are ignored.

Examples

# surprisals computed from linear model
of <- oldfaithful |>
  filter(duration < 7200, waiting < 7200)
lm_of <- lm(waiting ~ duration, data = of)
of |>
  mutate(
    fscore = surprisals(lm_of),
    loo_fscore = surprisals(lm_of, loo = TRUE),
    # lookout_prob = lookout(surprisals = fscore, loo_scores = loo_fscore)
  ) |>
  ggplot(aes(
    x = duration, y = waiting,
    color = loo_fscore > quantile(loo_fscore, 0.99)
  )) +
  geom_point()
# surprisals computed from GAM
of <- oldfaithful |>
  filter(duration > 1, duration < 7200, waiting < 7200)
gam_of <- mgcv::gam(waiting ~ s(duration), data = of)
of |>
  mutate(fscore = surprisals(gam_of))
# surprisals computed from linear model
of <- oldfaithful |>
  filter(duration < 7200, waiting < 7200)
lm_of <- lm(waiting ~ duration, data = of)
of |>
  mutate(
    fscore = surprisals(lm_of),
    loo_fscore = surprisals(lm_of, loo = TRUE),
    # lookout_prob = lookout(surprisals = fscore, loo_scores = loo_fscore)
  ) |>
  ggplot(aes(
    x = duration, y = waiting,
    color = loo_fscore > quantile(loo_fscore, 0.99)
  )) +
  geom_point()
# surprisals computed from GAM
of <- oldfaithful |>
  filter(duration > 1, duration < 7200, waiting < 7200)
gam_of <- mgcv::gam(waiting ~ s(duration), data = of)
of |>
  mutate(fscore = surprisals(gam_of))

Conflicts between weird packages and other packages

Description

This function lists all the conflicts between packages in the weird collection and other packages that you have loaded.

Usage

weird_conflicts()
weird_conflicts()

Details

Some conflicts are deliberately ignored: intersect, union, setequal, and setdiff from dplyr; and intersect, union, setdiff, and as.difftime from lubridate. These functions make the base equivalents generic, so shouldn't negatively affect any existing code.

Value

A list object of class weird_conflicts.

Examples

weird_conflicts()
weird_conflicts()

List all packages loaded by weird

Description

List all packages loaded by weird

Usage

weird_packages(include_self = FALSE)
weird_packages(include_self = FALSE)

Arguments

include_self

Include weird in the list?

Value

A character vector of package names.

Examples

weird_packages()
weird_packages()

Package 'weird'

Help Index

Cricket batting data for international test players

Description

Usage

Format

Value

Source

Examples

Create distributional object based on a specified density

Description

Usage

Arguments

Examples

Create distributional object based on a kernel density estimate

Description

Usage

Arguments

Examples

Wine prices and points

Description

Usage

Format

Value

Source

Examples

French mortality rates by age and sex

Description

Usage

Format

Value

Source

Examples

Bagplot

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Produce ggplot of densities from distributional objects in 1 or 2 dimensions

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Add ggplot layer of densities from distributional objects in 1 dimension

Description

Usage

Arguments

Details

Value

Author(s)

Examples

HDR plot

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

GLOSH scores

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Statistical tests for anomalies using Grubbs' test and Dixon's test

Description

Usage

Arguments