Level-4.1: Gap-filling

Level-4.1: Gap-filling#

Random forest#

All fluxes were gap-filled using the class LongTermGapFillingRandomForestTS from diive
This class uses a sliding three-year window to train annual random forest models, incorporating data from each target year and its two nearest neighbors.
For example: for gap-filling 2015, the model was trained on 2014, 2015 and 2016. For 2005 (the very first year for FC fluxes), the two closest years were used, i.e., the model was trained on 2005, 2006 and 2007. Likewise, for the very last year, the model was trained on data from the last year and the two preceding years.
Exception (2017, 2018, 2022): due to low flux availability for FN2O and FCH4 in 2017, 2018 and 2022, respective gap-filling models were built from all available, directly measured data (2012-2022) after quality checks and then used to gap-fill these three years. A completely new gap-filled flux version for the complete time range (2012-2022) was built. From that version, gap-filled data for the three years (2017, 2018, 2022) then replaced the data in the previously gap-filled version.
Exception (Feb/Mar 2012): The random forest model struggled to accurately predict N₂O and CH₄ fluxes following grassland restoration between 27 February 27 2012 8:00 and 16 March 2012 18:00. Consequently, these flux values were gap-filled using the classic MDS method (Reichstein et al., 2005), which utilized incoming shortwave radiation (SW_IN), soil temperature (TS), and soil water content (SWC) as predictor variables, instead of relying on the random forest predictions. This involved first removing the flux values previously imputed by the random forest model for this specific period and then applying the MDS method. This alternative gap-filling approach resulted in a more plausible representation of the flux dynamics.
Generally used random forest settings: n_estimators: 500, random_state: 42, min_samples_split: 2, min_samples_leaf: 1, default settings were used for all other parameters, see here.

Features (predictors)#

We chose features based on their demonstrated predictive ability in previous studies. As an example, regarding temperature, air temperature TA was selected for NEE predictions, while soil temperature TS was deemed more suitable for FCH4 and FN2O.
Variables used as features in training the random forest models.
In addition to features listed below, timestamp info was included as additional features:

++ Added new columns with timestamp info: ['.YEAR', '.SEASON', '.MONTH', '.WEEK', '.DOY', '.HOUR', '.YEARMONTH', '.YEARDOY', '.YEARWEEK']

Lagged variants of directly measured variables were included as features. Used variants were MEAN3H, step-lagged and TIMESINCE variants. For a description of variants see Overview: Variants.
Feature importances were calculated as permutation importance: Permutation feature importance assesses feature contributions to a model’s performance. It works by randomly shuffling a feature’s values, observing the resulting performance drop. This reveals how much the model relies on that feature. See here for the official description in scikit’s documentation.
Feature reduction: Feature reduction was performed by comparing feature permutation importances to a random variable (.RANDOM), which consisted of random numbers (floats) between 0 and 1. Features with importance below .RANDOM’s were discarded. To ensure consistency across yearly models, a feature was only removed if it was deemed unimportant in all yearly models. A feature was retained if it was deemed important in at least one model.

Important

By default, the random forest regressor in scikit uses Gini importance to calculate feature importances. However, Gini (impurity-based) feature importances can be misleading for high cardinality features (many unique values), as mentioned here. In the models (also during feature reduction) we used permutation importance instead of Gini.

Important

The amount of applied fertilizer nitrogen (N) was not used as a predictive feature. This decision stemmed from the difficulty in translating a single application event into a format suitable for machine learning. The substantial range of decay function values reported in the literature hindered the development of a consistent representation. We had no detailed information about possible decay rates over the long measurement campaign at the study site. A potential solution for future analyses involves carrying the total N amount forward as a repeated value until the next application occurs.

NEE, LE, H#

Features used for gap-filling fluxes.

MGMT_VARS = [
"TIMESINCE_MGMT_FERT_MIN_FOOTPRINT", "TIMESINCE_MGMT_FERT_ORG_FOOTPRINT", "TIMESINCE_MGMT_GRAZING_FOOTPRINT", "TIMESINCE_MGMT_MOWING_FOOTPRINT", "TIMESINCE_MGMT_SOILCULTIVATION_FOOTPRINT", "TIMESINCE_MGMT_SOWING_FOOTPRINT", "TIMESINCE_MGMT_PESTICIDE_HERBICIDE_FOOTPRINT"]

FEATURES = [
"SW_IN_T1_2_1", "TA_T1_2_1", "VPD_T1_2_1"] + MGMT_VARS

FN2O, FCH4#