Fill missing values with a random forest model

`fill_gaps()` replaces all missing values in a target time series with values estimated by a random forest model whose hyperparameters are calibrated using a grid search approach and out-of-bag evaluation.

Usage

fill_gaps(
  df,
  colname_label,
  vctr_colname_feature = NULL,
  vctr_min_nodesize = c(5),
  vctr_m_try = NULL,
  vctr_subsample_gf = c(1),
  frac_train = 0.75,
  n_tree = 500,
  ran_seed = 12345,
  label_err = -9999
)

Arguments

df: A data frame including label (explained variable) and feature (explanatory variables) time series for model input. It is acceptable to include missing values in each column.
colname_label: A character representing the name of the column for the label time series.
vctr_colname_feature: A vector of characters indicating the name of the feature time series columns used in constructing a random forest model. If `NULL` (default), all columns excluding the label column specified as `colname_label` in the input data frame are used as feature columns.
vctr_min_nodesize: A vector of positive integers indicating candidates of a hyperparameter for the random forest model, defining the minimal node size (the minimum number of data points included in each leaf node). Default is `c(5)`.
vctr_m_try: A vector of positive integers indicating candidates of a hyperparameter for the random forest model, defining the number of features to be used in splitting each node. If `NULL` (default), integers between two and the number of all feature variables are tested.
vctr_subsample_gf: A vector of numerical values between 0 and 1, indicating candidates of a hyperparameter for the random forest model, defining the fraction of input training data points to be sampled in constructing the random forest. Default is `c(1)`.
frac_train: A numerical value between 0 and 1, defining the fraction of data points to be categorized as training data for the random forest model construction. The other data points are classified as test data. Default is 0.75.
n_tree: An integer representing the number of trees in the random forest. Default is 500.
ran_seed: An integer representing the random seed for the random forest model construction. Default is 12345.
label_err: A numeric value representing a missing value in the input vector(s). Default is -9999.

Value

A list with two elements. The first element `mse` is the mean squared error between predicted and original values in the test data set. The second element `stats` is a data frame with columns below:

* The first column, `gapfilled`, gives the gap-filled time series, where missing values are replaced with the predicted values from the random forest model.

* The second column, `avg_predicted`, gives the ensemble mean time series calculated from estimated values at each time point for each tree in the constructed random forest.

* The third column, `sd_predicted`, gives the ensemble mean time series calculated from estimated values at each time point for each tree in the constructed random forest.

Details

A random forest model is constructed for the targeted time series to fill missing values. The time series is assumed to be stationary, so detrending is needed before inputting if it has a trend. Users can input any feature from the dataset, and out-of-bag evaluation is used to determine the hyperparameters. This evaluation is applied to a training dataset separated from the entire input data. To reduce the computational cost, the only hyperparameter used by default for grid search is the number of candidate features. After determining the optimal hyperparameters, they are used to construct the optimal random forest model. Predicted time series are equal to the average of 500 (default) tree outputs at each time point. If the input targeted value is missing, the predicted value is used for the imputation.

Author

Yoshiaki Hata

Examples

## Load data
data(dt_noisy)
df_raw <- dt_noisy[c(13105:14112), ]

## Remove error values for making data gaps
df_raw$dt <- ifelse(df_raw$dt > 9.5, df_raw$dt, -9999)

## Fill data gaps
result <-
  fill_gaps(df = df_raw, colname_label = "dt",
            vctr_colname_feature = c("sw_in", "vpd", "swc", "ta"))$stats
#> Random forest-based gap-filling started
#> --- Hyperparameter optimization using grid search started
#> --- MSE: Mean square error for out-of-bag data
#> --- Hyperparameter set: [m_try, min_nodesize, subsample]
#> --- MSE: 0.0676186047, Hyperparameter set: [2, 5, 1]
#> --- MSE: 0.06559005, Hyperparameter set: [3, 5, 1]
#> --- MSE: 0.0665401028, Hyperparameter set: [4, 5, 1]
#> --- Optimal hyperparameter set: [3, 5, 1]
#> --- Hyperparameter optimization using grid search finished
#> --- Random forest construction started
#> --- Random forest construction finished
#> Random forest-based gap-filling finished