Level-4.1: Gap-filling#
Description#
General approach#
Gap-filling for all fluxes was performed using the Random Forest (RF) class LongTermRandomForestTS
implemented in diive
, utilizing a sliding three-year window for annual model training. The class is based on the random forest implementation in the scikit-learn
(v1.15) Python library. For selecting model features, we selected variables based on their demonstrated predictive ability in previous studies. Generally, predictor variables included meteorological data, management information, timestamp information, and lagged variants. Feature selection employed permutation importance and a random variable (comprising random numbers between 0 and 1) comparison for robust feature reduction. Generally used random forest hyperparameters: n_estimators
: 500, random_state
: 42 (fixed for reproducibility), min_samples_split
: 2, min_samples_leaf
: 1; other hyperparameters used their respective default values.
Exceptions#
Exception (2012): The random forest model struggled to accurately predict FN2O
and FCH4
following grassland restoration between 27 February 27 2012 8:00 and 16 March 2012 18:00. Consequently, these flux values were gap-filled using the classic MDS method (Reichstein et al., 2005), which utilized incoming shortwave radiation (SWIN), soil temperature (TS), and soil water content (SWC) as predictor variables, instead of relying on the random forest predictions. This involved first removing the flux values previously imputed by the random forest model for this specific period and then applying the MDS method. This alternative gap-filling approach resulted in a more plausible representation of the flux dynamics.
Exception (2017, 2018, 2022): due to low flux availability for FN2O
and FCH4
in 2017, 2018 and 2022, respective gap-filling models were built from all available, directly measured data (2012-2022) after quality checks and then used to gap-fill these three years. A completely new gap-filled flux version for the complete time range (2012-2022) was built. From that version, gap-filled data for the three years (2017, 2018, 2022) then replaced the data in the previously gap-filled version.
Model features#
NEE, LE and H: Model features for random forest gap-filling of NEE, LE and H included shortwave incoming radiation (SWIN, at 2m height), air temperature (TA, at 2m height), vapor pressure deficit (VPD, calculated from relative humidity and TA at 2m height) and management information. Each variable was included: (1) as its measured (gap-filled) time series and (2) as time series lagged by one record, pairing each data record with the respective value of the preceding record. Additionally, PREC was included as TIMESINCE variable, counting the time (as number of records) since the last precipitation event. Management information for the two grassland parcels (PARCEL-A or PARCEL-B) was included in the form of TIMESINCE variables, which counted the number of records since the most recent occurrence of specific management events, putting each record in temporal relation to past events for the respective parcel. These events included applications of mineral fertilizer, organic fertilizer, grazing, mowing, soil cultivation, sowing, and pesticide/herbicide applications. Management TIMESINCE variables were then used in combination with wind direction to create FOOTPRINT variables. These variables linked each data record to information from either PARCEL-A or PARCEL-B variables, depending on the wind’s origin at the sensors, ensuring that the source parcel of the air was accounted for. This means, depending on wind direction, the FOOTPRINT variables can contain info from PARCEL-A or PARCEL-B variables. Timestamp info was included as additional features: YEAR (integer, e.g., 2021), SEASON (integer, e.g., 2 for summer months June, July and August), MONTH (integer, e.g. 7 for July), WEEK (integer, week of year, e.g., 52), DOY (integer, day of year), HOUR (integer between 0 and 23), YEARMONTH (string, year and month, e.g. 2021-07 for July 2021), YEARDOY (string, year and day of year, e.g. 2021-034), YEARWEEK (string, year and week of year, e.g., 2021-19 for week 19 in 2021).
FN2O, FCH4: Model features for gap-filling FN2O and FCH4 included soil temperature (TS), soil water content (SWC), precipitation (PREC) and management information. For TS, measurements at 4, 15, and 40 cm depths were used, while SWC at 15 cm and PREC at 50 cm height were included. Each variable was included as: (1) its measured (gap-filled) time series, (2) a 3-hour preceding mean, and (3) step-lagged variants representing 3, 6, 9, and 12-hour lagged means, as described by Feigenwinter et al. (2023). PREC, management information and timestamp information were included like for NEE, LE and H.
Feature importances were calculated as permutation importance: Permutation feature importance assesses feature contributions to a model’s performance. It works by randomly shuffling a feature’s values, observing the resulting performance drop.
Feature reduction was performed by comparing feature permutation importances to a random variable, which consisted of random numbers (floats) between 0 and 1. Features with importance below the importance of the random variable were discarded. To ensure consistency across yearly models, a feature was only removed if it was deemed unimportant in all yearly models. A feature was retained if it was deemed important in at least one model.
Details#
Random forest#
All fluxes were gap-filled using the class
LongTermGapFillingRandomForestTS
from diiveThis class uses a sliding three-year window to train annual random forest models, incorporating data from each target year and its two nearest neighbors.
For example: for gap-filling 2015, the model was trained on 2014, 2015 and 2016. For 2005 (the very first year for FC fluxes), the two closest years were used, i.e., the model was trained on 2005, 2006 and 2007. Likewise, for the very last year, the model was trained on data from the last year and the two preceding years.
Exception (2017, 2018, 2022): due to low flux availability for
FN2O
andFCH4
in 2017, 2018 and 2022, respective gap-filling models were built from all available, directly measured data (2012-2022) after quality checks and then used to gap-fill these three years. A completely new gap-filled flux version for the complete time range (2012-2022) was built. From that version, gap-filled data for the three years (2017, 2018, 2022) then replaced the data in the previously gap-filled version.Exception (Feb/Mar 2012): The random forest model struggled to accurately predict N₂O and CH₄ fluxes following grassland restoration between 27 February 27 2012 8:00 and 16 March 2012 18:00. Consequently, these flux values were gap-filled using the classic MDS method (Reichstein et al., 2005), which utilized incoming shortwave radiation (SW_IN), soil temperature (TS), and soil water content (SWC) as predictor variables, instead of relying on the random forest predictions. This involved first removing the flux values previously imputed by the random forest model for this specific period and then applying the MDS method. This alternative gap-filling approach resulted in a more plausible representation of the flux dynamics.
Generally used random forest settings:
n_estimators
: 500,random_state
: 42,min_samples_split
: 2,min_samples_leaf
: 1, default settings were used for all other parameters, see here.
Features (predictors)#
We chose features based on their demonstrated predictive ability in previous studies. As an example, regarding temperature, air temperature
TA
was selected forNEE
predictions, while soil temperatureTS
was deemed more suitable forFCH4
andFN2O
.Variables used as features in training the random forest models.
In addition to features listed below, timestamp info was included as additional features:
++ Added new columns with timestamp info: ['.YEAR', '.SEASON', '.MONTH', '.WEEK', '.DOY', '.HOUR', '.YEARMONTH', '.YEARDOY', '.YEARWEEK']
Lagged variants of directly measured variables were included as features. Used variants were
MEAN3H
, step-lagged andTIMESINCE
variants. For a description of variants see Overview: Variants.Feature importances were calculated as permutation importance: Permutation feature importance assesses feature contributions to a model’s performance. It works by randomly shuffling a feature’s values, observing the resulting performance drop. This reveals how much the model relies on that feature. See here for the official description in
scikit
’s documentation.Feature reduction: Feature reduction was performed by comparing feature permutation importances to a random variable (
.RANDOM
), which consisted of random numbers (floats) between 0 and 1. Features with importance below.RANDOM
’s were discarded. To ensure consistency across yearly models, a feature was only removed if it was deemed unimportant in all yearly models. A feature was retained if it was deemed important in at least one model.
Important
By default, the random forest regressor in scikit
uses Gini
importance to calculate feature importances. However, Gini
(impurity-based) feature importances can be misleading for high cardinality features (many unique values), as mentioned here. In the models (also during feature reduction) we used permutation importance instead of Gini
.
Important
The amount of applied fertilizer nitrogen (N
) was not used as a predictive feature. This decision stemmed from the difficulty in translating a single application event into a format suitable for machine learning. The substantial range of decay function values reported in the literature hindered the development of a consistent representation. We had no detailed information about possible decay rates over the long measurement campaign at the study site. A potential solution for future analyses involves carrying the total N
amount forward as a repeated value until the next application occurs.
NEE, LE, H#
Features used for gap-filling fluxes.
MGMT_VARS = [
"TIMESINCE_MGMT_FERT_MIN_FOOTPRINT", "TIMESINCE_MGMT_FERT_ORG_FOOTPRINT", "TIMESINCE_MGMT_GRAZING_FOOTPRINT", "TIMESINCE_MGMT_MOWING_FOOTPRINT", "TIMESINCE_MGMT_SOILCULTIVATION_FOOTPRINT", "TIMESINCE_MGMT_SOWING_FOOTPRINT", "TIMESINCE_MGMT_PESTICIDE_HERBICIDE_FOOTPRINT"]
FEATURES = [
"SW_IN_T1_2_1", "TA_T1_2_1", "VPD_T1_2_1"] + MGMT_VARS
FN2O, FCH4#
Features used for gap-filling fluxes.
# Management variables (also includes time since PRECIP)
MGMT_VARS = [
"TIMESINCE_MGMT_FERT_MIN_FOOTPRINT", "TIMESINCE_MGMT_FERT_ORG_FOOTPRINT", "TIMESINCE_MGMT_GRAZING_FOOTPRINT", "TIMESINCE_MGMT_MOWING_FOOTPRINT",
"TIMESINCE_MGMT_SOILCULTIVATION_FOOTPRINT", "TIMESINCE_MGMT_SOWING_FOOTPRINT",
"TIMESINCE_MGMT_PESTICIDE_HERBICIDE_FOOTPRINT", "TIMESINCE_PREC_RAIN_TOT_GF1_0.5_1"
]
# Already lagged variants of SWC, TS and PRECIP
AGG_VARS = [
# SWC
"SWC_GF1_0.15_1_gfXG_MEAN3H", ".SWC_GF1_0.15_1_gfXG_MEAN3H-24", ".SWC_GF1_0.15_1_gfXG_MEAN3H-18", ".SWC_GF1_0.15_1_gfXG_MEAN3H-12", ".SWC_GF1_0.15_1_gfXG_MEAN3H-6",
# TS
"TS_GF1_0.04_1_gfXG_MEAN3H", ".TS_GF1_0.04_1_gfXG_MEAN3H-24", ".TS_GF1_0.04_1_gfXG_MEAN3H-18", ".TS_GF1_0.04_1_gfXG_MEAN3H-12", ".TS_GF1_0.04_1_gfXG_MEAN3H-6", "TS_GF1_0.15_1_gfXG_MEAN3H", ".TS_GF1_0.15_1_gfXG_MEAN3H-24", ".TS_GF1_0.15_1_gfXG_MEAN3H-18", ".TS_GF1_0.15_1_gfXG_MEAN3H-12", ".TS_GF1_0.15_1_gfXG_MEAN3H-6",
"TS_GF1_0.4_1_gfXG_MEAN3H", ".TS_GF1_0.4_1_gfXG_MEAN3H-24", ".TS_GF1_0.4_1_gfXG_MEAN3H-18", ".TS_GF1_0.4_1_gfXG_MEAN3H-12", ".TS_GF1_0.4_1_gfXG_MEAN3H-6",
# PRECIP
"PREC_RAIN_TOT_GF1_0.5_1_MEAN3H", ".PREC_RAIN_TOT_GF1_0.5_1_MEAN3H-24", ".PREC_RAIN_TOT_GF1_0.5_1_MEAN3H-18", ".PREC_RAIN_TOT_GF1_0.5_1_MEAN3H-12", ".PREC_RAIN_TOT_GF1_0.5_1_MEAN3H-6"
]
METEO_VARS = [
"TS_GF1_0.04_1_gfXG", "TS_GF1_0.15_1_gfXG", "TS_GF1_0.4_1_gfXG", "SWC_GF1_0.15_1_gfXG", "PREC_RAIN_TOT_GF1_0.5_1"
]
FEATURES = METEO_VARS + AGG_VARS + MGMT_VARS