Skip to contents

`remove_zscore_outlier()` detects and removes outlier values by converting an original time series into a Z-score time series using a moving window.

Usage

remove_zscore_outlier(
  vctr_time,
  vctr_target,
  vctr_time_prd_tail = NULL,
  wndw_size_z = 48 * 15,
  min_n_wndw_z = 5,
  thres_z = 5,
  n_calc_max = 10,
  modify_z = FALSE,
  vctr_time_zmod = NULL,
  wndw_size_conv = 48 * 15,
  inv_sigma_conv = 0.01,
  thres_ratio = 0.5,
  label_err = -9999
)

Arguments

vctr_time

A timestamp vector of class POSIXct or POSIXt.

vctr_target

A vector of a targeted time series to be checked. The length of the time series must be the same as that of `vctr_time`.

vctr_time_prd_tail

A timestamp vector of class POSIXct or POSIXt, indicating the end timings of each sub-period. Note that users must not include the final timestamp for the entire time series. For instance, if users want to split the entire measurement period into three sub-periods, they only need to specify the end time stamps of the first two sub-periods. Default is `NULL`.

wndw_size_z

A positive integer indicating the number of data points included in a moving window for the Z-score outlier removal. The default is 48 * 15, meaning that the window size is 15 days if the time interval of the input timestamp is 30 minutes.

min_n_wndw_z

A positive integer indicating the minimum number of data points for calculating statistics using a moving window (default is 5) for the Z-score outlier removal. If the number of data points is less than this threshold, the statistics are not calculated in the window.

thres_z

A positive threshold value for the Z-score time series to define outliers. Default is 5.0. The data points with Z-scores (absolute values) above the threshold are considered outliers and removed.

n_calc_max

A positive integer indicating the maximum number of Z-score outlier detection iterations. Default is 10.

modify_z

A boolean. If `TRUE`, conduct Z-score short attenuation correction; else, the correction is not applied. Default is `FALSE`.

vctr_time_zmod

Only valid if `modify_z` is `TRUE`. A timestamp vector of class POSIXct or POSIXt, indicating the timings when the short-term signal attenuation correction is applied. Default is `NULL`.

wndw_size_conv

Only valid if `modify_z` is `TRUE`. A positive integer indicating the number of data points included in a moving window for the short-term signal attenuation detection. The default is 48 * 15, meaning that the window size is 15 days if the time interval of the input timestamp is 30 minutes.

inv_sigma_conv

Only valid if `modify_z` is `TRUE`. A positive value defining a Gaussian window width for the short-term signal attenuation detection. The width of the Gaussian window is inversely proportional to this parameter. Default is 0.01.

thres_ratio

Only valid if `modify_z` is `TRUE`. A positive threshold value of the ratio for determining whether the signal attenuation correction is applied to each detected attenuation period. The ratio represents the average of the standard deviation at the detected attenuation peak relative to that at the beginning and end of the attenuation period. If the ratio is below this threshold value, the correction is applied. Default is 0.5.

label_err

A numeric value representing a missing value in the input vector(s). Default is -9999.

Value

A data frame with columns below:

* The first column, `time`, gives the same timestamp as `vctr_time`.

* The second column, `cleaned`, gives the cleaned time series after replacing the detected outliers with the value specified by `label_err`.

* The third column, `z_cleaned`, gives the Z-score of the input time series after removing the detected outliers.

* The fourth column, `avg_cleaned`, gives the moving window average of the input time series after removing the detected outliers.

* The fifth column, `sd_cleaned`, gives the moving window standard deviation of the input time series after removing the detected outliers.

* The sixth column, `flag_out` gives a flag variable time series indicating the status of the cleaned time series (0: the input data point is not originally missing and not detected as an outlier; 1: the input data point is not originally missing but detected as an outlier; 2: the input data point is originally missing).

Details

The input time series is standardized using a moving window, and the data values are converted to Z-scores. In this step, the width of the moving window is set to 15 days by default, centered on the target time point, and standardization is performed individually for each time point in the time series. The threshold of the Z-score absolute value (default: 5 as specified by 'thres_z') is set, and data points outside that range are removed as outliers. After the outliers have been removed, the Z-score is returned to the original value using the original mean and standard deviation time series, and standardization is performed again using a moving window to remove additional outliers. These procedures are repeated until either no more outliers are removed or the maximum number of iterations (default 10) is reached.

Users can define sub-periods across the entire time series using `vctr_time_prd_tail`, and the Z-score conversion is performed in each sub-period separately. This separated conversion is useful when the input time series suddenly changes its nature, such as after a sensor replacement.

In some cases, for sap flow measurements, the input dT (the temperature difference between sap flow probes) time series may yield a signal that is attenuated for only a short period, for example, when rainfall continues for days, causing the moving window mean (or standard deviation) to increase (or decrease). In such cases, standardization will cause the Z-score time series immediately before and after the rainfall to be unnaturally distorted, hindering the construction of the random forest model. If `modify_z` is `TRUE`, after the outlier removal, this function modifies the Z-score time series for periods when the moving window average has an upward peak, and the moving window standard deviation has a downward peak simultaneously. First, the average and standard deviation time series are interpolated if they contain missing values. Second, they are smoothed by convolution with a user-specified Gaussian window, defined by the parameters `wndw_size_conv` and `inv_sigma_conv`. Third, the first-order and second-order differences of both smoothed time series are calculated, which determine the upward peak positions of the average and the downward peak positions of the standard deviation. Fourth, possible signal attenuation periods are determined based on these peak positions. The start and end of the periods are defined by the timings when the first-order differenced standard deviation time series changes its sign before and after each peak. Fifth, the final attenuation periods are selected if the average of the ratio of the standard deviation at the detected attenuation peak to that at the beginning and end of the attenuation period is below the threshold value specified by `thres_ratio`. Optionally, users can specify the periods to be modified by `vctr_time_zmod`. Sixth, the average and standard deviation time series during the attenuation periods are deleted and linearly interpolated. Finally, the modified Z-score time series is calculated using these average and standard deviation time series.

Author

Yoshiaki Hata

Examples

## Load data
data(dt_noisy)
time <- dt_noisy$time[12097:14400]
target <- dt_noisy$dt[12097:14400]

## Remove outliers
result <- remove_zscore_outlier(vctr_time = time, vctr_target = target)
#> Z-score outlier detection started
#> --- 8 Z-score outliers were detected
#> --- 1 Z-score outlier was detected
#> --- No Z-score outlier was detected
#> Z-score outlier detection finished