Title: | Time Series Clustering Utilities |
---|---|
Description: | A set of measures of dissimilarity between time series to perform time series clustering. Metrics based on raw data, on generating models and on the forecast behavior are implemented. Some additional utilities related to time series clustering are also provided, such as clustering algorithms and cluster evaluation metrics. |
Authors: | Pablo Montero Manso [cre], Jose Vilar Fernandez [aut] |
Maintainer: | Pablo Montero Manso <[email protected]> |
License: | GPL-2 |
Version: | 1.3.1 |
Built: | 2024-10-27 04:04:35 UTC |
Source: | https://github.com/pmontman/tsclust |
Computes the similarity between the true cluster solution and the one obtained with a method under evaluation.
cluster.evaluation(G, S)
cluster.evaluation(G, S)
G |
Integer vector with the labels of the true cluster solution. Each element of the vector specifies the cluster 'id' that the element belongs to. |
S |
Integer vector with the labels of the cluster solution to be evaluated. Each element of the vector specifies the cluster 'id' that the element belongs to. |
The measure of clustering evaluation is defined as
where
with |.| denoting the cardinality of the elements in the set. This measure has been used for comparing different clusterings, e.g. in Kalpakis et al. (2001) and Pértega and Vilar (2010).
The computed index.
This index is not simmetric.
Pablo Montero Manso, José Antonio Vilar.
Larsen, B. and Aone, C. (1999) Fast and effective text mining using linear-time document clustering. Proc. KDD' 99.16–22.
Kalpakis, K., Gada D. and Puttagunta, V. (2001) Distance measures for effective clustering of arima time-series. Proceedings 2001 IEEE International Conference on Data Mining, 273–280.
Pértega S. and Vilar, J.A (2010) Comparing several parametric and nonparametric approaches to time series clustering: A simulation study. J. Classification, 27(3), 333-362.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
cluster.stats
, clValid
, std.ext
#create a true cluster #(first 4 elements belong to cluster '1', next 4 to cluster '2' and the last 4 to cluster '3'. true_cluster <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3) #the cluster to be tested new_cluster <- c( 2, 1, 2, 3, 3, 2, 2, 1, 3, 3, 3, 3) #get the index cluster.evaluation(true_cluster, new_cluster) #it can be seen that the index is not simmetric cluster.evaluation(new_cluster, true_cluster)
#create a true cluster #(first 4 elements belong to cluster '1', next 4 to cluster '2' and the last 4 to cluster '3'. true_cluster <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3) #the cluster to be tested new_cluster <- c( 2, 1, 2, 3, 3, 2, 2, 1, 3, 3, 3, 3) #get the index cluster.evaluation(true_cluster, new_cluster) #it can be seen that the index is not simmetric cluster.evaluation(new_cluster, true_cluster)
Computes the dissimilarity matrix of the given numeric matrix, list, data.frame or mts
object using the selected TSclust dissimilarity method.
diss(SERIES, METHOD, ...)
diss(SERIES, METHOD, ...)
SERIES |
Numeric matrix, |
METHOD |
the dissimilarity measure to be used. This must be one of "ACF", "AR.LPC.CEPS", "AR.MAH", "AR.PIC", "CDM", "CID", "COR", "CORT", "DTWARP", "DWT", "EUCL", "FRECHET", INT.PER", "NCD", "PACF", "PDC", PER", "PRED", "MINDIST.SAX", "SPEC.LLR", "SPEC.GLK" or "SPEC.ISD". Any unambiguous substring can be given. See details for individual usage. |
... |
Additional arguments for the selected method. |
SERIES
argument can be a numeric matrix, with one row per series, a list
object with one numeric vector per element, a data.frame
or a mts
object.
Some methods can have additional arguments. See the individual help page for each dissimilarity method, detailed below.
Methods that have arguments that require one value per time series in series
must provide so using a vector, a matrix (in the case of a multivalued argument) or a list when appropiate. In the case of a matrix, the values are conveyed row-wise. See the AR.LPC.CEPS example below.
"ACF" Autocorrelation-based method. See diss.ACF
.
"AR.LPC.CEPS" Linear Predictive Coding ARIMA method. This method has two value-per-series arguments, the ARIMA order, and the seasonality.See diss.AR.LPC.CEPS
.
"AR.MAH" Model-based ARMA method. See diss.AR.MAH
.
"AR.PIC" Model-based ARMA method. This method has a value-per-series argument, the ARIMA order. See diss.AR.PIC
.
"CDM" Compression-based dissimilarity method. See diss.CDM
.
"CID" Complexity-Invariant distance. See diss.CID
.
"COR" Correlation-based method. See diss.COR
.
"CORT" Temporal Correlation and Raw values method. See diss.CORT
.
"DTWARP" Dynamic Time Warping method. See diss.DTWARP
.
"DWT" Discrete wavelet transform method. See diss.DWT
.
"EUCL" Euclidean distance. See diss.EUCL
. For many more convetional distances, see link[stats]{dist}
, though you may need to transpose the dataset.
"FRECHET" Frechet distance. See diss.FRECHET
.
"INT.PER" Integrate Periodogram-based method. See diss.INT.PER
.
"NCD" Normalized Compression Distance. See diss.NCD
.
"PACF" Partial Autocorrelation-based method. See diss.PACF
.
"PDC" Permutation distribution divergence. Uses the pdc
package. See pdcDist
for
additional arguments and details. Note that series given by numeric matrices are interpreted row-wise and not column-wise, opposite as in pdcDist
.
"PER" Periodogram-based method. See diss.PER
.
"PRED" Prediction Density-based method. This method has two value-per-series agument, the logarithm and difference transform. See diss.PRED
.
"MINDIST.SAX" Distance that lower bounds the Euclidean, based on the Symbolic Aggregate approXimation measure. See diss.MINDIST.SAX
.
"SPEC.LLR" Spectral Density by Local-Linear Estimation method. See diss.SPEC.LLR
.
"SPEC.GLK" Log-Spectra Generalized Likelihood Ratio test method. See diss.SPEC.GLK
.
"SPEC.ISD" Intregated Squared Differences between Log-Spectras method. See diss.SPEC.ISD
.
dist |
A |
Some methods produce additional output, see their respective documentation pages for more information.
Pablo Montero Manso, José Antonio Vilar.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
data(electricity) diss(electricity, METHOD="INT.PER", normalize=FALSE) ## Example of multivalued, one per series argument ## The AR.LPC.CEPS dissimilarity allows the specification of the ARIMA model for each series ## Create three sample time series and a mts object x <- arima.sim(model=list(ar=c(0.4,-0.1)), n =100, n.start=100) y <- arima.sim(model=list(ar=c(0.9)), n =100, n.start=100) z <- arima.sim(model=list(ar=c(0.5, 0.2)), n =100, n.start=100) seriests <- rbind(x,y,z) ## If we want to provide the ARIMA order for each series ## and use it with AR.LPC.CEPS, we create a matrix with the row-wise orders orderx <- c(2,0,0) ordery <- c(1,0,0) orderz <- c(2,0,0) orders = rbind(orderx, ordery, orderz) diss( seriests, METHOD="AR.LPC.CEPS", k=30, order= orders ) ##other examples diss( seriests, METHOD="MINDIST.SAX", w=10, alpha=4 ) diss( seriests, METHOD="PDC" )
data(electricity) diss(electricity, METHOD="INT.PER", normalize=FALSE) ## Example of multivalued, one per series argument ## The AR.LPC.CEPS dissimilarity allows the specification of the ARIMA model for each series ## Create three sample time series and a mts object x <- arima.sim(model=list(ar=c(0.4,-0.1)), n =100, n.start=100) y <- arima.sim(model=list(ar=c(0.9)), n =100, n.start=100) z <- arima.sim(model=list(ar=c(0.5, 0.2)), n =100, n.start=100) seriests <- rbind(x,y,z) ## If we want to provide the ARIMA order for each series ## and use it with AR.LPC.CEPS, we create a matrix with the row-wise orders orderx <- c(2,0,0) ordery <- c(1,0,0) orderz <- c(2,0,0) orders = rbind(orderx, ordery, orderz) diss( seriests, METHOD="AR.LPC.CEPS", k=30, order= orders ) ##other examples diss( seriests, METHOD="MINDIST.SAX", w=10, alpha=4 ) diss( seriests, METHOD="PDC" )
Computes the dissimilarity between two time series as the distance between their estimated simple (ACF) or partial (PACF) autocorrelation coefficients.
diss.ACF(x, y, p = NULL, omega=NULL, lag.max=50) diss.PACF(x, y, p = NULL, omega=NULL, lag.max=50)
diss.ACF(x, y, p = NULL, omega=NULL, lag.max=50) diss.PACF(x, y, p = NULL, omega=NULL, lag.max=50)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
p |
If not NULL, sets the weight for the geometric decaying of the autocorrelation coefficients. Ranging from |
lag.max |
Maximum number of simple or partial autocorrelation coefficients to be considered. |
omega |
If not NULL, completely specifies the weighting matrix for the autocorrelation coefficients. |
Performs the weighted Euclidean distance between the simple autocorrelation ( dist.ACF
) or partial autocorrelation ( dist.PACF
) coefficients.
If neither p
nor omega
are specified, uniform weighting is used. If p
is specified, geometric wights decaying with the lag in the form are applied. If
omega
() is specified,
with and
the respective (partial) autocorrelation coefficient vectors.
The computed distance.
Pablo Montero Manso, José Antonio Vilar.
Galeano, P. and Peña, D. (2000). Multivariate analysis in vector time series. Resenhas, 4 (4), 383–403.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.PACF(x, y) diss.ACF(x, z) diss.PACF(y, z) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "ACF", p=0.05)
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.PACF(x, y) diss.ACF(x, z) diss.PACF(y, z) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "ACF", p=0.05)
Computes the dissimilarity between two time series in terms of their Linear Predicitive Coding (LPC) ARIMA processes.
diss.AR.LPC.CEPS(x, y, k = 50, order.x=NULL, order.y=NULL, seasonal.x=list(order=c(0, 0, 0), period=NA), seasonal.y=list(order=c(0, 0, 0), period=NA), permissive=TRUE)
diss.AR.LPC.CEPS(x, y, k = 50, order.x=NULL, order.y=NULL, seasonal.x=list(order=c(0, 0, 0), period=NA), seasonal.y=list(order=c(0, 0, 0), period=NA), permissive=TRUE)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
k |
Number of cepstral coefficients to be considered. |
order.x |
Numeric matrix. Specifies the ARIMA models to be fitted for the series x. When using |
order.y |
Numeric matrix. Specifies the ARIMA ARIMA models to be fitted for the series y. When using |
seasonal.x |
A list of |
seasonal.y |
A list of |
permissive |
Specifies whether to force an AR order of 1 if no order is found. Ignored if neither order.x or order.y are NULL |
If order.x
or order.y are NULL
, their respective series will be fitted automatically using a AR model.
order.x
and order.y
contain the three components of the ARIMA model: the AR order, the degree of differencing and the MA order, specified as in the function arima
.
seasonal.x
and seasonal.y
are lists with two components: 'order' and 'period'. See seasonal
parameter of arima
, except that specification using a numeric vector
of length 3 is not allowed.
If using diss
function with "AR.LPC.CEPS" method
, the argument order
must be used instead of order.x
and order.y
. order
is a matrix with one row per series, specified as in arima
. If order
is NULL
, automatic fitting imposing a AR model is performed. The argument seasonal
is used instead of seasonal.x
and seasonal.y
. seasonal
is a list of elements, one per series in the same order that the series are input. Each element of seasonal
must have the same format as the one in arima
.
The computed distance.
Pablo Montero Manso, José Antonio Vilar.
Kalpakis, K., Gada D. and Puttagunta, V. (2001) Distance measures for effective clustering of arima time-series. Proceedings 2001 IEEE International Conference on Data Mining, 273–280.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
diss.AR.PIC
, diss.AR.MAH
, diss
## Create three sample time series x <- arima.sim(model=list(ar=c(0.4,-0.1)), n =100, n.start=100) y <- arima.sim(model=list(ar=c(0.9)), n =100, n.start=100) z <- arima.sim(model=list(ar=c(0.5, 0.2)), n =100, n.start=100) ## Compute the distance and check for coherent results diss.AR.LPC.CEPS(x, y, 25) #impose an AR automatically selected for both series #impose an ARIMA(2,0,0) for series x and an AR automatically selected for z diss.AR.LPC.CEPS(x, z, 25, order.x = c(2,0,0), order.y = NULL ) diss.AR.LPC.CEPS(y, z, 25) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), METHOD="AR.LPC.CEPS", k=20, order=rbind(c(2,0,0), c(1,0,0), c(2,0,0)), seasonal=list( list(order=c(1,0,0), period=1), list(order=c(2,0,0), period=3), list(order=c(1,0,0), period=1)) )
## Create three sample time series x <- arima.sim(model=list(ar=c(0.4,-0.1)), n =100, n.start=100) y <- arima.sim(model=list(ar=c(0.9)), n =100, n.start=100) z <- arima.sim(model=list(ar=c(0.5, 0.2)), n =100, n.start=100) ## Compute the distance and check for coherent results diss.AR.LPC.CEPS(x, y, 25) #impose an AR automatically selected for both series #impose an ARIMA(2,0,0) for series x and an AR automatically selected for z diss.AR.LPC.CEPS(x, z, 25, order.x = c(2,0,0), order.y = NULL ) diss.AR.LPC.CEPS(y, z, 25) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), METHOD="AR.LPC.CEPS", k=20, order=rbind(c(2,0,0), c(1,0,0), c(2,0,0)), seasonal=list( list(order=c(1,0,0), period=1), list(order=c(2,0,0), period=3), list(order=c(1,0,0), period=1)) )
Computes the dissimilarity between two time series by testing whether both series are or not generated by the same ARMA model.
diss.AR.MAH(x, y, dependence=FALSE, permissive=TRUE)
diss.AR.MAH(x, y, dependence=FALSE, permissive=TRUE)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
dependence |
Boolean for considering dependence between observations of the series at the same point in time. |
permissive |
Boolean for continuing with the method even if no valid order is selected by AIC. |
Assuming that the time series x and y belong to the class of invertible and stationary ARMA processes, this dissimilarity measure is based on checking the equality of their underlying ARMA models by following the testing procedures proposed by Maharaj (1996,2000). The ARMA structures are approximated by truncated AR() models with a common order
, where
and
are determined by the AIC criterion. The AR coefficients are automatically fitted. The dissimilarity can be evaluated by using the value of the test statistic or alternatively the associated p-value. If
dependence
is FALSE
, the dissimilarity measure is constructed by following the procedure introduced by Maharaj (1996), which is designed to compare independent time series. Otherwise, a more general testing procedure is used (Maharaj, 2000), which assumes that both models are correlated at the same time points but uncorrelated across observations (Maharaj, 2000).
When permissive
argument is TRUE
, if the automatic fitting of the AR order fails, the method shows a warning and then forces an AR of order 1. If permissive
is FALSE
the method produces an error if no AR order is found by AIC.
statistic |
The statistic of the homogeneity test. |
p_value |
The p-value of the homogeneity test. |
Pablo Montero Manso, José Antonio Vilar.
Maharaj, E.A. (1996) A significance test for classifying ARMA models. J. Statist. Comput. Simulation, 54(4), 305–331.
Maharaj E.A. (2000) Clusters of time series. J. Classification, 17(2), 297–314.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
## Create three sample time series x <- arima.sim(model=list(ar=c(0.4,-0.1)), n =100, n.start=100) y <- arima.sim(model=list(ar=c(0.9)), n =100, n.start=100) z <- arima.sim(model=list(ar=c(0.5, 0.2)), n =100, n.start=100) ## Compute the distance and check for coherent results diss.AR.MAH(x, y) diss.AR.MAH(x, z) diss.AR.MAH(y, z) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "AR.MAH")$statistic
## Create three sample time series x <- arima.sim(model=list(ar=c(0.4,-0.1)), n =100, n.start=100) y <- arima.sim(model=list(ar=c(0.9)), n =100, n.start=100) z <- arima.sim(model=list(ar=c(0.5, 0.2)), n =100, n.start=100) ## Compute the distance and check for coherent results diss.AR.MAH(x, y) diss.AR.MAH(x, z) diss.AR.MAH(y, z) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "AR.MAH")$statistic
Computes the distance between two time series as the Euclidean distance between the truncated AR operators approximating their ARMA structures.
diss.AR.PIC(x, y, order.x=NULL, order.y=NULL, permissive=TRUE)
diss.AR.PIC(x, y, order.x=NULL, order.y=NULL, permissive=TRUE)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
order.x |
Specifies the ARIMA model to be fitted for the series x. When using |
order.y |
Specifies the ARIMA model to be fitted for the series y. When using |
permissive |
Specifies whether to force an AR order of 1 if no order is found. Ignored if neither order.x or order.y are NULL |
If order.x
or order.y are NULL
, their respective series will be fitted automatically using a AR model. If permissive
is TRUE
and no AR order is found automatically, an AR order of 1 will be imposed, if this case fails, then no order can be found and the function produces an error.
order.x
and order.y
contain the three components of the ARIMA model: the AR order, the degree of differencing and the MA order, specified as in the function arima
.
If using diss
function with "AR.PIC" method
, the argument order
must be used instead of order.x
and order.y
. orders
is a matrix with one row per ARIMA, specified as in arima
. If order
is NULL
, automatic fitting imposing a AR model is performed.
The computed distance.
Pablo Montero Manso, José Antonio Vilar.
Piccolo, D. (1990) A distance measure for classifying arima models. J. Time Series Anal., 11(2), 153–164.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
diss.AR.MAH
, diss.AR.LPC.CEPS
, diss
, arima
, ar
## Create three sample time series x <- arima.sim(model=list(ar=c(0.4,-0.1)), n =100, n.start=100) y <- arima.sim(model=list(ar=c(0.9)), n =100, n.start=100) z <- arima.sim(model=list(ar=c(0.5, 0.2)), n =100, n.start=100) ## Compute the distance and check for coherent results #ARIMA(2,0,0) for x and ARIMA(1,0,0) for y diss.AR.PIC( x, y, order.x = c(2,0,0), order.y = c(1,0,0) ) diss.AR.PIC( x, z, order.x = c(2,0,0), order.y = c(2,0,0) ) # AR for y (automatically selected) and ARIMA(2,0,0) for z diss.AR.PIC( y, z, order.x=NULL, order.y=c(2,0,0) ) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), METHOD="AR.PIC", order=rbind(c(2,0,0), c(1,0,0), c(2,0,0)) )
## Create three sample time series x <- arima.sim(model=list(ar=c(0.4,-0.1)), n =100, n.start=100) y <- arima.sim(model=list(ar=c(0.9)), n =100, n.start=100) z <- arima.sim(model=list(ar=c(0.5, 0.2)), n =100, n.start=100) ## Compute the distance and check for coherent results #ARIMA(2,0,0) for x and ARIMA(1,0,0) for y diss.AR.PIC( x, y, order.x = c(2,0,0), order.y = c(1,0,0) ) diss.AR.PIC( x, z, order.x = c(2,0,0), order.y = c(2,0,0) ) # AR for y (automatically selected) and ARIMA(2,0,0) for z diss.AR.PIC( y, z, order.x=NULL, order.y=c(2,0,0) ) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), METHOD="AR.PIC", order=rbind(c(2,0,0), c(1,0,0), c(2,0,0)) )
Computes the dissimilarity based on the sizes of the compressed time series.
diss.CDM(x, y, type = "min")
diss.CDM(x, y, type = "min")
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
type |
Character string, the type of compression. May be abbreviated to a single letter, defaults to the first of the alternatives. |
The compression based dissimilarity is calculated:
where ,
are the sizes in bytes of the compressed series
and
.
is the size in bytes of the series
and
concatenated. The algorithm used for compressing the series is chosen with
type
.
type
can be "gzip", "bzip2" or "xz", see memCompress
. "min" selects the best separately for x
, y
and the concatenation.
Since the compression methods are character-based, a symbolic representation can be used, see details for an example using SAX as the symbolic representation.
The series are transformed to a text representation prior to compression using as.character
, so small numeric differences may produce significantly different text representations.
While this dissimilarity is asymptotically symmetric, for short series the differences between diss.CDM(x,y)
and diss.CDM(y,x)
may be noticeable.
The computed dissimilarity.
Pablo Montero Manso, José Antonio Vilar.
Keogh, E., Lonardi, S., & Ratanamahatana, C. A. (2004). Towards parameter-free data mining. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 206-215).
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
memCompress
, diss
, diss.NCD
, PAA
, convert.to.SAX.symbol
n = 50 x <- rnorm(n) #generate sample series, white noise and a wiener process y <- cumsum(rnorm(n)) diss.CDM(x, y) z <- rnorm(n) w <- cumsum(rnorm(n)) series = rbind(x, y, z, w) diss(series, "CDM", type="bzip2") ################################################################ #####symbolic representation prior to compression, using SAX#### ####simpler symbolization, such as round() could also be used### ################################################################ #normalization function, required for SAX z.normalize = function(x) { (x - mean(x)) / sd(x) } sx <- convert.to.SAX.symbol( z.normalize(x), alpha=4 ) sy <- convert.to.SAX.symbol( z.normalize(y), alpha=4 ) sz <- convert.to.SAX.symbol( z.normalize(z), alpha=4 ) sw <- convert.to.SAX.symbol( z.normalize(w), alpha=4 ) diss(rbind(sx, sy, sz, sw), "CDM", type="bzip2")
n = 50 x <- rnorm(n) #generate sample series, white noise and a wiener process y <- cumsum(rnorm(n)) diss.CDM(x, y) z <- rnorm(n) w <- cumsum(rnorm(n)) series = rbind(x, y, z, w) diss(series, "CDM", type="bzip2") ################################################################ #####symbolic representation prior to compression, using SAX#### ####simpler symbolization, such as round() could also be used### ################################################################ #normalization function, required for SAX z.normalize = function(x) { (x - mean(x)) / sd(x) } sx <- convert.to.SAX.symbol( z.normalize(x), alpha=4 ) sy <- convert.to.SAX.symbol( z.normalize(y), alpha=4 ) sz <- convert.to.SAX.symbol( z.normalize(z), alpha=4 ) sw <- convert.to.SAX.symbol( z.normalize(w), alpha=4 ) diss(rbind(sx, sy, sz, sw), "CDM", type="bzip2")
Computes the distance based on the Euclidean distance corrected by the complexity estimation of the series.
diss.CID(x, y)
diss.CID(x, y)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
This distance is defined
where is a complexity correction factor defined as:
and is a compexity estimate of a time series
.
diss.CID
therefore increases the distance between series with different complexities. If the series have the same complexity estimate, the distance defenerates Euclidean distance. The complexity is defined in diss.CID
as:
The computed dissimilarity.
Pablo Montero Manso, José Antonio Vilar.
Batista, G. E., Wang, X., & Keogh, E. J. (2011). A Complexity-Invariant Distance Measure for Time Series. In SDM (Vol. 31, p. 32).
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
n = 100 x <- rnorm(n) #generate sample series, white noise and a wiener process y <- cumsum(rnorm(n)) diss.CID(x, y) z <- rnorm(n) w <- cumsum(rnorm(n)) series = rbind(x, y, z, w) diss(series, "CID")
n = 100 x <- rnorm(n) #generate sample series, white noise and a wiener process y <- cumsum(rnorm(n)) diss.CID(x, y) z <- rnorm(n) w <- cumsum(rnorm(n)) series = rbind(x, y, z, w) diss(series, "CID")
Computes dissimilarities based on the estimated Pearson's correlation of two given time series.
diss.COR(x, y, beta = NULL)
diss.COR(x, y, beta = NULL)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
beta |
If not NULL, specifies the regulation of the convergence in the second method. |
Two different measures of dissimilarity between two time series based on the estimated Pearson's correlation can be computed.
If beta
is not specified, the value is computed, where
denotes the Pearson's correlation between series
x
and y
.
If beta
is specified, the function is used, where
is
beta
.
The computed distance.
Pablo Montero Manso, José Antonio Vilar.
Golay, X., Kollias, S., Stoll, G., Meier, D., Valavanis, A., and Boesiger, P. (2005) A new correlation-based fuzzy logic clustering algorithm for FMRI. Magnetic Resonance in Medicine, 40.2, 249–260.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.COR(x, y) diss.COR(x, z) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "COR")
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.COR(x, y) diss.COR(x, z) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "COR")
Computes an adaptive dissimilarity index between two time series that covers both dissimilarity on raw values and dissimilarity on temporal correlation behaviors.
diss.CORT(x, y, k = 2, deltamethod="Euclid")
diss.CORT(x, y, k = 2, deltamethod="Euclid")
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
k |
Parameter controlling the weight of the dissimilarity between dynamic behaviors (See Details). |
deltamethod |
Defines the method for the raw data discrepancy. Either |
The dissimilarity between time series x
and y
is given by:
where:
CORT(x,y) measures the proximity between the dynamic behaviors of x and y by means of the first order temporal correlation coefficient defined by:
is an adaptive tuning function taking the form:
with so that both
and
k
modulate the weight that CORT(x,y) has on d(x,y).
denotes a dissimilarity measure between the raw values of series
x
and y
, such as the Euclidean distance, the Frechet distance or the Dynamic Time Warping distance. Note that if
k=0
.
More details of the procedure can be seen in Chouakria-Douzal and Nagabhushan (2007).
deltamethod
() can be either Euclidean (
deltamethod = "Euclid"
), Frechet ( deltamethod = "Frechet"
) or Dynamic Time Warping (deltamethod ="DTW"
) distances. When calling from dis.CORT
, DTW uses Manhattan as local distance.
The computed distance.
Pablo Montero Manso, José Antonio Vilar.
Chouakria-Douzal, A. and Nagabhushan P. N. (2007) Adaptive dissimilarity index for measuring time series proximity. Adv. Data Anal. Classif., 1(1), 5–21.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
diss.COR
, diss.DTWARP
, diss.FRECHET
, distFrechet
, dtw
.
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.CORT(x, y, 2) diss.CORT(x, z, 2) diss.CORT(y, z, 2) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "CORT", k=3, deltamethod="DTW")
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.CORT(x, y, 2) diss.CORT(x, z, 2) diss.CORT(y, z, 2) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "CORT", k=3, deltamethod="DTW")
Computes Dynamic Time Warping distance between time series.
diss.DTWARP(x, y, ...)
diss.DTWARP(x, y, ...)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
... |
Additional parameters for the function. See |
This is basically a wrapper for dtw
of the pdc
package, intended for an easier discovery of the functionalities used in TSclust.
The computed distance.
diss
, link[dtw]{dtw}
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) diss.DTWARP(x, y)
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) diss.DTWARP(x, y)
Performs an unsupervised feature extration using orthogonal wavelets on the series and returns the Euclidean distance between the wavelet approximations in an appropriate scale.
diss.DWT(series)
diss.DWT(series)
series |
Numeric matrix with row order time series |
This method differs from other dissimilarities in that pairwise dissimilaries depend on the whole dataset that is given to diss.DWT
, hence, there is no pairwise version of the function defined, only accepts whole datasets.
The set of original series is replaced by their wavelet approximation coefficients in an appropriate scale, and the dissimilarity between two series is computed as the Euclidean distance between these coefficients. The appropriate scale is automatically determined by using an algorithm addressed to obtain an efficient reduction of the dimensionality but preserving as much information from the original data as possible. The algorithm is introduced by Zhang, Ho, Zhang, and Lin (2006).
Returns an object of type dist
with the pairwise distances.
Pablo Montero Manso, José Antonio Vilar.
Zhang, H., Ho, T. B., Zhang, Y., and Lin, M. (2006) Unsupervised feature extraction for time series clustering using orthogonal wavelet transform. INFORMATICA-LJUBLJANA-, 30(3), 305.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) #compute the distance diss.DWT(rbind(x, y, z))
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) #compute the distance diss.DWT(rbind(x, y, z))
Computes Euclidean distance between time series.
diss.EUCL(x, y)
diss.EUCL(x, y)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
The computed distance.
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) diss.EUCL(x, y)
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) diss.EUCL(x, y)
Computes the Frechet distance between time series.
diss.FRECHET(x, y, ...)
diss.FRECHET(x, y, ...)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
... |
Additional parameters for the function. See |
This is basically a wrapper for distFrechet
of the longitudinalData
package, intended for an easier discovery of the functionalities used in TSclust.
The computed distance.
diss
, link[longitudinalData]{distFrechet}
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) diss.FRECHET(x, y)
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) diss.FRECHET(x, y)
Computes the dissimilarity between two time series in terms of the distance between their integrated periodograms.
diss.INT.PER(x, y, normalize=TRUE)
diss.INT.PER(x, y, normalize=TRUE)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
normalize |
If |
The distance is computed as:
where and
, with
and
in the normalized version.
and
in the non-normalized version.
and
denote the periodograms of
x
and y
, respectively.
The computed distance.
Pablo Montero Manso, José Antonio Vilar.
Casado de Lucas, D. (2010) Classification techniques for time series and functional data.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.INT.PER(x, y, normalize=TRUE) diss.INT.PER(x, y, normalize=TRUE) diss.INT.PER(x, y, normalize=TRUE) diss( rbind(x,y,z), "INT.PER", normalize=FALSE )
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.INT.PER(x, y, normalize=TRUE) diss.INT.PER(x, y, normalize=TRUE) diss.INT.PER(x, y, normalize=TRUE) diss( rbind(x,y,z), "INT.PER", normalize=FALSE )
diss.MINDIST.SAX
computes a dissimilarity that lower bounds the Euclidean on the discretized, dimensionality reduced series. Function PAA
produces the dimension reduction. Function convert.to.SAX.symbol
produces the discretization.
diss.MINDIST.SAX(x, y, w, alpha=4, plot=FALSE) PAA(x, w) convert.to.SAX.symbol(x, alpha) MINDIST.SAX(x, y, alpha, n) SAX.plot(series, w, alpha, col.ser=rainbow(ncol(as.matrix(series))))
diss.MINDIST.SAX(x, y, w, alpha=4, plot=FALSE) PAA(x, w) convert.to.SAX.symbol(x, alpha) MINDIST.SAX(x, y, alpha, n) SAX.plot(series, w, alpha, col.ser=rainbow(ncol(as.matrix(series))))
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
w |
The amount of equal sized frames that the series will be reduced to. |
alpha |
The size of the alphabet, the amount of symbols used to represents the values of the series. |
plot |
If |
n |
The original size of the series. |
series |
A |
col.ser |
Colors for the series. One per series. |
SAX is a symbolic representation of continuous time series.
w
must be an integer but it does not need to divide the length of the series. If w
divides the length of the series, the diss.MINDIST.SAX
plot uses this to show the size of the frames.
PAA
performs the Piecewise Aggregate Approximation of the series, reducing it to w
elements, called frames. Each frame is composed by observations of the original series, averaged. Observations are weighted when
w
does not divide n
.
convert.to.SAX.symbol
performs SAX discretization: Discretizes the series x
to an alphabet of size alpha
, x
should be z-normalized in this case. The distribution is divided in
alpha
equal probability parts, if an observation falls into the th part (starting from minus infinity), it is assigned the
symbol.
MINDIST.SAX
calculates the MINDIST dissimilarity between symbolic representations.
diss.MINDIST.SAX
combines the previous procedures to compute a dissimilarity between series. The series are z-normalized at first. Then the dimensionality is reduced uusin PAA
to produce series of length w
. The series are discretized to an alphabet of size alpha
using convert.to.SAX.symbol
. Finally the dissimilarity value is produced using MINDIST.SAX
.
SAX.plot
produces a plot of the SAX representation of the given series
.
The computed dissimilarity.
Pablo Montero Manso, José Antonio Vilar.
Lin, J., Keogh, E., Lonardi, S. & Chiu, B. (2003) A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.
Keogh, E., Chakrabarti, K., Pazzani, M., & Mehrotra, S. (2001). Dimensionality reduction for fast similarity search in large time series databases. Knowledge and information Systems, 3(3), 263-286.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
set.seed(12349) n = 100 x <- rnorm(n) #generate sample series, white noise and a wiener process y <- cumsum(rnorm(n)) w <- 20 #amount of equal-sized frames to divide the series, parameters for PAA alpha <- 4 #size of the alphabet, parameter for SAX #normalize x <- (x - mean(x)) /sd(x) y <- (y - mean(y)) /sd(y) paax <- PAA(x, w) #generate PAA reductions paay <- PAA(y, w) plot(x, type="l", main="PAA reduction of series x") #plot an example of PAA reduction p <- rep(paax,each=length(x)/length(paax)) #just for plotting the PAA lines(p, col="red") #repeat the example with y plot(y, type="l", main="PAA reduction of series y") py <- rep(paay,each=length(y)/length(paay)) lines(py, col="blue") #convert to SAX representation SAXx <- convert.to.SAX.symbol( paax, alpha) SAXy <- convert.to.SAX.symbol( paay, alpha) #CALC THE SAX DISTANCE MINDIST.SAX(SAXx, SAXy, alpha, n) #this whole process can be computed using diss.MINDIST.SAX diss.MINDIST.SAX(x, y, w, alpha, plot=TRUE) z <- rnorm(n)^2 diss(rbind(x,y,z), "MINDIST.SAX", w, alpha) SAX.plot( as.ts(cbind(x,y,z)), w=w, alpha=alpha)
set.seed(12349) n = 100 x <- rnorm(n) #generate sample series, white noise and a wiener process y <- cumsum(rnorm(n)) w <- 20 #amount of equal-sized frames to divide the series, parameters for PAA alpha <- 4 #size of the alphabet, parameter for SAX #normalize x <- (x - mean(x)) /sd(x) y <- (y - mean(y)) /sd(y) paax <- PAA(x, w) #generate PAA reductions paay <- PAA(y, w) plot(x, type="l", main="PAA reduction of series x") #plot an example of PAA reduction p <- rep(paax,each=length(x)/length(paax)) #just for plotting the PAA lines(p, col="red") #repeat the example with y plot(y, type="l", main="PAA reduction of series y") py <- rep(paay,each=length(y)/length(paay)) lines(py, col="blue") #convert to SAX representation SAXx <- convert.to.SAX.symbol( paax, alpha) SAXy <- convert.to.SAX.symbol( paay, alpha) #CALC THE SAX DISTANCE MINDIST.SAX(SAXx, SAXy, alpha, n) #this whole process can be computed using diss.MINDIST.SAX diss.MINDIST.SAX(x, y, w, alpha, plot=TRUE) z <- rnorm(n)^2 diss(rbind(x,y,z), "MINDIST.SAX", w, alpha) SAX.plot( as.ts(cbind(x,y,z)), w=w, alpha=alpha)
Computes the distance based on the sizes of the compressed time series.
diss.NCD(x, y, type = "min")
diss.NCD(x, y, type = "min")
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
type |
Character string, the type of compression. May be abbreviated to a single letter, defaults to the first of the alternatives. |
The compression based dissimilarity is calculated:
where ,
are the sizes in bytes of the compressed series
and
.
is the size in bytes of the series
and
concatenated. The algorithm used for compressing the series is chosen with
type
.
type
can be "gzip", "bzip2" or "xz", see memCompress
. "min" selects the best separately for x
, y
and the concatenation.
Since the compression methods are character-based, a symbolic representation can be used, see details for an example using SAX as the symbolic representation.
The series are transformed to a text representation prior to compression using as.character
, so small numeric differences may produce significantly different text representations.
While this dissimilarity is asymptotically symmetric, for short series the differences between diss.NCD(x,y)
and diss.NCD(y,x)
may be noticeable.
The computed distance.
Pablo Montero Manso, José Antonio Vilar.
Cilibrasi, R., & Vitányi, P. M. (2005). Clustering by compression. Information Theory, IEEE Transactions on, 51(4), 1523-1545.
Keogh, E., Lonardi, S., & Ratanamahatana, C. A. (2004). Towards parameter-free data mining. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 206-215).
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
n = 50 x <- rnorm(n) #generate sample series, white noise and a wiener process y <- cumsum(rnorm(n)) diss.NCD(x, y) z <- rnorm(n) w <- cumsum(rnorm(n)) series = rbind(x, y, z, w) diss(series, "NCD", type="bzip2") ################################################################ #####symbolic representation prior to compression, using SAX#### ####simpler symbolization, such as round() could also be used### ################################################################ #normalization function, required for SAX z.normalize = function(x) { (x - mean(x)) / sd(x) } sx <- convert.to.SAX.symbol( z.normalize(x), alpha=4 ) sy <- convert.to.SAX.symbol( z.normalize(y), alpha=4 ) sz <- convert.to.SAX.symbol( z.normalize(z), alpha=4 ) sw <- convert.to.SAX.symbol( z.normalize(w), alpha=4 ) diss(rbind(sx, sy, sz, sw), "NCD", type="bzip2")
n = 50 x <- rnorm(n) #generate sample series, white noise and a wiener process y <- cumsum(rnorm(n)) diss.NCD(x, y) z <- rnorm(n) w <- cumsum(rnorm(n)) series = rbind(x, y, z, w) diss(series, "NCD", type="bzip2") ################################################################ #####symbolic representation prior to compression, using SAX#### ####simpler symbolization, such as round() could also be used### ################################################################ #normalization function, required for SAX z.normalize = function(x) { (x - mean(x)) / sd(x) } sx <- convert.to.SAX.symbol( z.normalize(x), alpha=4 ) sy <- convert.to.SAX.symbol( z.normalize(y), alpha=4 ) sz <- convert.to.SAX.symbol( z.normalize(z), alpha=4 ) sw <- convert.to.SAX.symbol( z.normalize(w), alpha=4 ) diss(rbind(sx, sy, sz, sw), "NCD", type="bzip2")
Computes the Permutation Distribution distance between time series.
diss.PDC(x, y, ...)
diss.PDC(x, y, ...)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
... |
Additional parameters for the function. See |
This is basically a wrapper for pdcDist
of the pdc
package, intended for an easier discovery of the functionalities used in TSclust.
The computed distance.
diss
, link[pdc]{pdcDist}
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) diss.PDC(x, y)
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) diss.PDC(x, y)
Computes the distance between two time series based on their periodograms.
diss.PER(x, y, logarithm=FALSE, normalize=FALSE)
diss.PER(x, y, logarithm=FALSE, normalize=FALSE)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
logarithm |
Boolean. If |
normalize |
Boolean. If |
Computes the Euclidean distance between the periodogram coefficients of the series x
and y
. Additional transformations can be performed on the coefficients depending on the values of logarithm
and normalize
.
The computed distance.
Pablo Montero Manso, José Antonio Vilar.
Caiado, J., Crato, N. and Peña, D. (2006) A periodogram-based metric for time series classification. Comput. Statist. Data Anal., 50(10), 2668–2684.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
link{diss.INT.PER}
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.PER(x, y) diss.PER(x, z) diss.PER(y, z) diss.PER(x, y, TRUE, TRUE) diss.PER(x, z, TRUE, TRUE) diss.PER(y, z, TRUE, TRUE) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "PER", logarithm=TRUE, normalize=TRUE)
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.PER(x, y) diss.PER(x, z) diss.PER(y, z) diss.PER(x, y, TRUE, TRUE) diss.PER(x, z, TRUE, TRUE) diss.PER(y, z, TRUE, TRUE) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "PER", logarithm=TRUE, normalize=TRUE)
Computes the dissimilarity between two time series as the L1 distance between the kernel estimators of their forecast densities at a pre-specified horizon.
diss.PRED(x, y, h, B=500, logarithm.x=FALSE, logarithm.y=FALSE, differences.x=0, differences.y=0, plot=FALSE, models = NULL)
diss.PRED(x, y, h, B=500, logarithm.x=FALSE, logarithm.y=FALSE, differences.x=0, differences.y=0, plot=FALSE, models = NULL)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
h |
The horizon of interest, i.e the number of steps-ahead where the prediction is evaluated. |
B |
The amount of bootstrap resamples. |
logarithm.x |
Boolean. Specifies whether to transform series x by taking logarithms or not. When using |
logarithm.y |
Boolean. Specifies whether to transform series y by taking logarithms or not. When using |
differences.x |
Specifies the amount of differences to apply to series x. When using |
differences.y |
Specifies the amount of differences to apply to series y. When using |
plot |
If |
models |
A list containing either |
The dissimilarity between the time series x
and y
is given by
where and
are kernel density estimators of the forecast densities h-steps ahead of
x
and y
, respectively. The horizon of interest h is pre-specified by the user.
If models
is specified, the given model for each series is used for obtaining
the forecast densities. Currently, each element of the models
list can be the string "ets"
, which will fit a ets model using the function ets
in the forecast
package. If the element of models
is the string "arima", an ARIMA model using auto.arima
from the forecast package will be used. Finally, the elements of models can be a fitted model on the series using a method from the forecast
package which can be simulated, see link[forecast]{simulate.ets}
.
The kernel density estimators are based on B bootstrap replicates obtained by using a resampling procedure that mimics the generating processes, which are assumed to follow an arbitrary autoregressive structure (parametric or non-parametric). The procedure is completely detailed in Vilar et al. (2010). This function has high computational cost due to the bootstrapping procedure.
The procedure uses a bootstrap method that requires stationary time series. In order to support a wider range of time series, the method allows some transformations on the series before proceeding with the bootstrap resampling. This transformations are inverted before calculating the densities. The transformations allowed are logarithm and differenciation.
The parameters logarithm.x
, logarithm.y
, differences.x
, differences.y
can be specified with this purpose.
If using diss
function with "PRED" method
, the argument logarithms
must be used instead of logarithm.x
and logarithm.y
. logarithms
is a boolean vector specifying if the logarithm transform should be taken for each one of the series
. The argument differences
, a numeric vector specifying the amount of differences to apply the series
, is used instead of differences.x
and differences.y
. The plot is also different, showing all the densities in the same plot.
diss.PRED
returns a list with the following components.
L1dist |
The computed distance. |
dens.x |
A 2-column matrix with the density of predicion of series |
dens.y |
A 2-column matrix with the density of predicion of series |
When used from the diss
wrapper function, it returns a list with the following components.
dist |
A |
densities |
A list of 2-column matrices containing the densities of each series, in the same format as 'dens.x' or 'dens.y' of |
José Antonio Vilar, Pablo Montero Manso.
Alonso, A.M., Berrendero, J.R., Hernandez, A. and Justel, A. (2006) Time series clustering based on forecast densities. Comput. Statist. Data Anal., 51,762–776.
Vilar, J.A., Alonso, A. M. and Vilar, J.M. (2010) Non-linear time series clustering based on non-parametric forecast densities. Comput. Statist. Data Anal., 54 (11), 2850–2865.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
diss
, link[forecast]{auto.arima}
, link[forecast]{ets}
, link[forecast]{simulate.ets}
x <- (rnorm(100)) x <- x + abs(min(x)) + 1 #shift to produce values greater than 0, for a correct logarithm transform y <- (rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.PRED(x, y, h=6, logarithm.x=FALSE, logarithm.y=FALSE, differences.x=1, differences.y=0) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), METHOD="PRED", h=3, B=200, logarithms=c(TRUE,FALSE, FALSE), differences=c(1,1,2) ) #test the forecast package predictions diss.PRED(x,y, h=5, models = list("ets", "arima"))
x <- (rnorm(100)) x <- x + abs(min(x)) + 1 #shift to produce values greater than 0, for a correct logarithm transform y <- (rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## Compute the distance and check for coherent results diss.PRED(x, y, h=6, logarithm.x=FALSE, logarithm.y=FALSE, differences.x=1, differences.y=0) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), METHOD="PRED", h=3, B=200, logarithms=c(TRUE,FALSE, FALSE), differences=c(1,1,2) ) #test the forecast package predictions diss.PRED(x,y, h=5, models = list("ets", "arima"))
The dissimilarity between two time series is computed by using an adaptation of the generalized likelihood ratio test to check the equality of two log-spectra.
diss.SPEC.GLK(x, y, plot=FALSE)
diss.SPEC.GLK(x, y, plot=FALSE)
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
plot |
If |
The dissimilarity between two series x
and y
is measured in terms of the vaue of a test statistic to check the equality of their log-spectra, and
respectivelty. The test statistic is constructed by using the generalized likelihood ratio test criterion (Fan and Zhang, 2004). Specifically, the test statistic takes the form:
where and
are the periodograms of
x
and y
, , and
is the local maximum log-likelihood estimator of
computed by local linear fitting.
The computed distance.
Pablo Montero Manso, José Antonio Vilar.
Fan, J. and Zhang, W. (2004) Generalised likelihood ratio tests for spectral density. Biometrika, 195–209.
Pértega, S. and Vilar, J.A. (2010) Comparing several parametric and nonparametric approaches to time series clustering: A simulation study. J. Classification, 27(3), 333–362.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
## Create two sample time series x <- cumsum(rnorm(50)) y <- cumsum(rnorm(50)) z <- sin(seq(0, pi, length.out=50)) ## Compute the distance and check for coherent results diss.SPEC.GLK(x, y, plot=TRUE) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "SPEC.GLK" )
## Create two sample time series x <- cumsum(rnorm(50)) y <- cumsum(rnorm(50)) z <- sin(seq(0, pi, length.out=50)) ## Compute the distance and check for coherent results diss.SPEC.GLK(x, y, plot=TRUE) #create a dist object for its use with clustering functions like pam or hclust diss( rbind(x,y,z), "SPEC.GLK" )
Computes the dissimilarity between two time series in terms of the integrated squared difference between non-parametric estimators of their log-spectra.
diss.SPEC.ISD(x, y, plot=FALSE, n=length(x))
diss.SPEC.ISD(x, y, plot=FALSE, n=length(x))
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
plot |
If |
n |
The number of points to use for the linear interpolation. A value of n=0 uses numerical integration instead of linear interpolation. See details. |
where and
are local linear smoothers of the log-periodograms, obtained using the maximum local likelihood criterion.
By default, for performance reasons, the spectral densities are estimated using linear interpolation using n
points. If n
is 0, no linear interpolation is performed, and integrate
is used to calculate the integral, using as many points as integrate
sees fit.
The computed distance.
Pablo Montero Manso, José Antonio Vilar.
Pértega, S. and Vilar, J.A. (2010) Comparing several parametric and nonparametric approaches to time series clustering: A simulation study. J. Classification, 27(3), 333–362.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
## Create two sample time series x <- cumsum(rnorm(50)) y <- cumsum(rnorm(50)) z <- sin(seq(0, pi, length.out=50)) ## Compute the distance and check for coherent results diss.SPEC.ISD(x, y, plot=TRUE) #create a dist object for its use with clustering functions like pam or hclust diss.SPEC.ISD(x, y, plot=TRUE, n=0)#try integrate instead of interpolation diss( rbind(x,y,z), "SPEC.ISD" )
## Create two sample time series x <- cumsum(rnorm(50)) y <- cumsum(rnorm(50)) z <- sin(seq(0, pi, length.out=50)) ## Compute the distance and check for coherent results diss.SPEC.ISD(x, y, plot=TRUE) #create a dist object for its use with clustering functions like pam or hclust diss.SPEC.ISD(x, y, plot=TRUE, n=0)#try integrate instead of interpolation diss( rbind(x,y,z), "SPEC.ISD" )
Computes a general dissimilarity measure based on the ratio of local linear spectral estimators.
diss.SPEC.LLR(x, y, alpha=0.5, method="DLS", plot=FALSE, n=length(x))
diss.SPEC.LLR(x, y, alpha=0.5, method="DLS", plot=FALSE, n=length(x))
x |
Numeric vector containing the first of the two time series. |
y |
Numeric vector containing the second of the two time series. |
alpha |
Power for the ratio of densities in the Chernoff information measure. Between 0 and 1. |
method |
|
plot |
if |
n |
The number of points to use for the linear interpolation. A value of n=0 uses numerical integration instead of linear interpolation. See details. |
where:
and
are nonparametric approximations of spectral densities of
x
and y
respectively.
with
, so that
is a divergence function depending on
.
This dissimilarity measure corresponds to the limiting spectral approximation of the Chernoff information measure in the time domain (see Kakizawa et al., 1998). The spectral densities are approximated by using local linear fitting by generalized least squared if method=”DLS”
or by maximum likelihood if method=”LK”
(in this case, higher computational cost is required).
By default, for performance reasons, the spectral densities are estimated using linear interpolation using n
points. If n
is 0, no linear interpolation is performed, and integrate
is used to calculate the integral, using as many points as integrate
sees fit.
If the dissimilarity will be calculated for more than two series, calling SPEC.LLR from the diss
wrapper function is preferred, since it saves some computations.
The computed distance.
Pablo Montero Manso, José Antonio Vilar.
Vilar, J.A. and Pértega, S. (2004) Discriminant and cluster analysis for gaussian stationary processes:
local linear fitting approach. J. Nonparametr. Stat., 16(3-4) 443–462.
Kakizawa, Y., Shumway, R. H. and Taniguchi M. (1998) Discrimination and clustering for multivariate time series. J. Amer. Statist. Assoc., 93(441), 328– 340.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
diss.SPEC.GLK
, diss.SPEC.ISD
, integrate
## Create three sample time series x <- cumsum(rnorm(50)) y <- cumsum(rnorm(50)) z <- sin(seq(0, pi, length.out=50)) ## Compute the distance and check for coherent results diss.SPEC.LLR(x, y, plot=TRUE) diss.SPEC.LLR(x, z, n=0) #try integrate instead of interpolation diss.SPEC.LLR(y, z, method="LK", n=0) #maximum likelihood with integration #create a dist object for its use with clustering functions like pam or hclust diss(rbind(x,y,z), METHOD="SPEC.LLR", method="DLS", alpha=0.5, n=50)
## Create three sample time series x <- cumsum(rnorm(50)) y <- cumsum(rnorm(50)) z <- sin(seq(0, pi, length.out=50)) ## Compute the distance and check for coherent results diss.SPEC.LLR(x, y, plot=TRUE) diss.SPEC.LLR(x, z, n=0) #try integrate instead of interpolation diss.SPEC.LLR(y, z, method="LK", n=0) #maximum likelihood with integration #create a dist object for its use with clustering functions like pam or hclust diss(rbind(x,y,z), METHOD="SPEC.LLR", method="DLS", alpha=0.5, n=50)
Partial realizations of time series of hourly electricity prices (Cent/kWh) in the Spanish market.
data(electricity)
data(electricity)
A matrix with 365 observations of the hourly electricity cost readings.
The dataset consists of hourly electricity prices in the Spanish market during weekdays of the period December 31, 2007 - May 25, 2009.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
Partial realizations of time series of long-term interest rates (10-year bonds) for several countries.
data(interest.rates)
data(interest.rates)
A ts
object with 215 observations of the monthly long-term interest rates (10-year bonds) from January 1995 to November 2012 of several countries.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
Computes the leave-one-out one-nearest-neighbor cross-validation of an arbitrary distance matrix.
loo1nn.cv(d, G)
loo1nn.cv(d, G)
d |
A |
G |
Integer vector with the labels of the true cluster solution. Each element of the vector specifies the cluster 'id' that the element belongs to. |
Computes the proportion of succesful clusters that the given distance matrix produces using leave-one-out one-nearest-neighbor cross-validation. Distance ties are solved by majority vote. A tie while voting produces a warning and is solved by selecting a candidate cluster at random.
The computed proportion.
Pablo Montero Manso, José Antonio Vilar.
cluster.evaluation
, loo1nn
, knn.cv
,
data(synthetic.tseries) #create the ground thruth cluster G <- rep(1:6, each = 3) #obtain candidate distance matrix (dist object) dACF <- diss(synthetic.tseries, "ACF") #calculate the cross-validation loo1nn.cv(dACF, G)
data(synthetic.tseries) #create the ground thruth cluster G <- rep(1:6, each = 3) #obtain candidate distance matrix (dist object) dACF <- diss(synthetic.tseries, "ACF") #calculate the cross-validation loo1nn.cv(dACF, G)
Dataset formed by pairs of time series from different domains. Series were selected from the UCR Time Series Archive.
data(paired.tseries)
data(paired.tseries)
A mts
object with 36 series of length 1000.
Each pair of series in the dataset (Series 1 and 2, Series 3 and 4, etc.) comes from the same domain, so this pairing could constitute a possible ground truth solution.
abbreviate
can be used on the colnames
.
http://www.cs.ucr.edu/~eamonn/SIGKDD2004/All_datasets/
Keogh, E., Lonardi, S., & Ratanamahatana, C. A. (2004). Towards parameter-free data mining. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 206-215).
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
data(paired.tseries) #Create the true solution, the pairs true_cluster <- rep(1:18, each=2) #test a dissimilarity metric and a cluster algorithm intperdist <- diss( paired.tseries, "INT.PER") #create the distance matrix #use hierarchical clustering and divide the tree in 18 clusters intperclust <- cutree( hclust(intperdist), k=18 ) #use a cluster simmilarity index to rate the solution cluster.evaluation( true_cluster, intperclust) #### other evaluation criterion used in this dataset consist in counting the correct pairs #### formed during agglomerative hierarchical cluster (see references) true_pairs = (-matrix(1:36, ncol=2, byrow=TRUE)) hcintper <- hclust(intperdist, "complete") #count within the hierarchical cluster the pairs sum( match(data.frame(t(true_pairs)), data.frame(t(hcintper$merge)), nomatch=0) > 0 ) / 18
data(paired.tseries) #Create the true solution, the pairs true_cluster <- rep(1:18, each=2) #test a dissimilarity metric and a cluster algorithm intperdist <- diss( paired.tseries, "INT.PER") #create the distance matrix #use hierarchical clustering and divide the tree in 18 clusters intperclust <- cutree( hclust(intperdist), k=18 ) #use a cluster simmilarity index to rate the solution cluster.evaluation( true_cluster, intperclust) #### other evaluation criterion used in this dataset consist in counting the correct pairs #### formed during agglomerative hierarchical cluster (see references) true_pairs = (-matrix(1:36, ncol=2, byrow=TRUE)) hcintper <- hclust(intperdist, "complete") #count within the hierarchical cluster the pairs sum( match(data.frame(t(true_pairs)), data.frame(t(hcintper$merge)), nomatch=0) > 0 ) / 18
Clustering algorithm based on p-values. Each group in the cluster solution is formed by series with associated p-values greater than a pre-specified level of significance.
pvalues.clust(pvalues, significance)
pvalues.clust(pvalues, significance)
pvalues |
A |
significance |
The significance level. |
Each element (i,j) in pvalues
corresponds to the p-value obtained from checking whether or not the -th and
-th series come from the same generating
model. The clustering algorithm will only group together those series whose associated p-values are greater than the pre-specified significance level. The algorithm was originally developed for its use with the p-values obtained with in
diss.AR.MAH
(see Maharaj, 2000), but it can be applied to any similar test.
An integer vector of length n, the number of observations, giving for each observation the number (id) of the cluster to which it belongs.
Pablo Montero Manso, José Antonio Vilar.
Maharaj E.A. (2000) Clusters of time series. J. Classification, 17(2), 297–314.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## ## Compute the distance and check for coherent results dd <- diss( rbind(x,y,z), "AR.MAH") pvalues.clust( dd$p_value, 0.05 )
## Create three sample time series x <- cumsum(rnorm(100)) y <- cumsum(rnorm(100)) z <- sin(seq(0, pi, length.out=100)) ## ## Compute the distance and check for coherent results dd <- diss( rbind(x,y,z), "AR.MAH") pvalues.clust( dd$p_value, 0.05 )
This dataset features three repetitions of several models of time series.
data(synthetic.tseries)
data(synthetic.tseries)
The dataset is a mts
object, formed by several repetitions of each of the following models.
M1 | AR | |
M2 | Bilinear | |
M3 | EXPAR | |
M4 | SETAR | |
|
||
M5 | NLAR | |
M6 | STAR | |
Three simulations of each model are included. This dataset can be used for comparing the performance of different dissimilarity measures between time series or clustering algorithms.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
data(synthetic.tseries) #Create the true solution, for this dataset, there are three series of each model true_cluster <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6) #test a dissimilarity metric and a cluster algorithm intperdist <- diss( synthetic.tseries, "INT.PER") #create the distance matrix #use hierarchical clustering and divide the tree in 6 clusters intperclust <- cutree( hclust(intperdist), 6 ) #use a cluster simmilarity index to rate the solution cluster.evaluation( true_cluster, intperclust) #test another dissimilarity metric and a cluster algorithm acfdist <- diss( synthetic.tseries, "ACF", p=0.05) acfcluster <- pam( acfdist, 6 )$clustering #use pam clustering to form 6 clusters cluster.evaluation( true_cluster, acfcluster) #test another dissimilarity metric and a cluster algorithm chernoffdist <- diss( synthetic.tseries, "SPEC.LLR") chernoffclust <- pam( chernoffdist, 6 )$clustering cluster.evaluation( true_cluster, chernoffclust)
data(synthetic.tseries) #Create the true solution, for this dataset, there are three series of each model true_cluster <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6) #test a dissimilarity metric and a cluster algorithm intperdist <- diss( synthetic.tseries, "INT.PER") #create the distance matrix #use hierarchical clustering and divide the tree in 6 clusters intperclust <- cutree( hclust(intperdist), 6 ) #use a cluster simmilarity index to rate the solution cluster.evaluation( true_cluster, intperclust) #test another dissimilarity metric and a cluster algorithm acfdist <- diss( synthetic.tseries, "ACF", p=0.05) acfcluster <- pam( acfdist, 6 )$clustering #use pam clustering to form 6 clusters cluster.evaluation( true_cluster, acfcluster) #test another dissimilarity metric and a cluster algorithm chernoffdist <- diss( synthetic.tseries, "SPEC.LLR") chernoffclust <- pam( chernoffdist, 6 )$clustering cluster.evaluation( true_cluster, chernoffclust)
This package contains several measures of dissimilarity between time series, some examples of time series datasets, specific clustering algorithms, and dimension reduction algorithms.
dissimilarities begin with diss.*, and a wrapper function diss
is available. Cluster evaluation methods include cluster.evaluation
and loo1nn.cv
. A clustering algorithm based on pairwise p-values is implemented in pvalues.clust
. The package should be used along with other existing clustering packages and function such as hclust
, packages cluster
, ...
Pablo Montero Manso, José Antonio Vilar.
Montero, P and Vilar, J.A. (2014) TSclust: An R Package for Time Series Clustering. Journal of Statistical Software, 62(1), 1-43. http://www.jstatsoft.org/v62/i01/.
#the available dissimilarities can be found in the diss help, page (?diss) #and their individual pages from there. ### The most common use case begins with a set of time series we want to cluster. ### This package includes several example datasets. ### data(interest.rates) ###transformation of the interest rates trans.inter.rates <- log(interest.rates[2:215,]) - log(interest.rates[1:214,]) ##use the dist function of the proxy package to easily create the dist object #applying ACF with geometric decaying to each pair of time series tsdist <- diss( t(trans.inter.rates) , "ACF", p=0.05) names(tsdist) <- colnames(interest.rates) #perform hierachical clustering to the dist object hc <- hclust(tsdist) #show the results plot(hc) mahdist <- diss( t(trans.inter.rates) , "AR.MAH", p=0.05)$p_value pvalues.clust(mahdist, 0.05)
#the available dissimilarities can be found in the diss help, page (?diss) #and their individual pages from there. ### The most common use case begins with a set of time series we want to cluster. ### This package includes several example datasets. ### data(interest.rates) ###transformation of the interest rates trans.inter.rates <- log(interest.rates[2:215,]) - log(interest.rates[1:214,]) ##use the dist function of the proxy package to easily create the dist object #applying ACF with geometric decaying to each pair of time series tsdist <- diss( t(trans.inter.rates) , "ACF", p=0.05) names(tsdist) <- colnames(interest.rates) #perform hierachical clustering to the dist object hc <- hclust(tsdist) #show the results plot(hc) mahdist <- diss( t(trans.inter.rates) , "AR.MAH", p=0.05)$p_value pvalues.clust(mahdist, 0.05)