Running the Virtual Ecosystem for your location#

This page guides you through setting up Virtual Ecosystem simulations for your location. The complete setup requires extensive data and effort, so this tutorial focuses on the general process of changing model settings and loading new data. For the full setup, consult the core settings and model-specific setup details. We strongly recommend running the example simulation before configuring a new site.

To run a Virtual Ecosystem simulation for your own location you will need to:

  1. Select the set of models you want to run

  2. Provide any location specific constant values (as well as any constants that you disagree with our choice of default value for)

  3. Provide the core details of the experiment you want to run, e.g. grid size, grid resolution, simulation length, etc

  4. Provide NetCDF input data that matches the spatial (and in some cases temporal) dimensions of your experimental area

  5. Provide data on the plant functional types and animal functional groups included in the simulation (as csv inputs)

Configuration system overview#

All the changes you will need to make to setup the Virtual Ecosystem will involve making changes to the configuration. So, we will start this tutorial with a brief overview of how the Virtual Ecosystem configuration system is used.

The configuration can be split up over whatever amount of files you wish (though we would advise against structuring it as one massive hard to read file, or hundreds of tiny files). All configuration files must be written as toml. When the run starts, the configuration inputs are combined and the resulting combined model configuration is validated. By default, the combined configuration is written out to a single file to provide a permanent record of the model configuration. In all cases, your toml configuration files only need to specify values that do not have a default value (typically file paths) or where you want to change a default.

An example of a toml configuration is shown below:

[core]
[core.grid]
cell_nx = 10
cell_ny = 10

Here, the first tag indicates the module in question (e.g. core), and subsequent tags indicate (potentially nested) module level configuration details (e.g. horizontal grid size cell_nx).

Note that configuration setting cannot be repeated between files as there is no way to establish which of two values (of e.g. core.grid.cell_nx) the user intended to provide. When settings are repeated, the validation of the configuration will fail.

Validation occurs automatically when the simulation starts. If any issues are found then the simulation will terminate, with the details of the issues being written to the simulation log file. The validation checks for a much broader range of things that just repeated settings, including that configured input files actually exist and that numeric inputs are within a range of accepted values.

Selecting the models you want to run#

The Virtual Ecosystem allows you to choose which set of models you wish to run. Unless you are trying to run static model simulations (in which case consult the static mode guide), you will always want to run the primary set of Virtual Ecosystem models. In this case, the only choice you will have to make is which microclimate implementation you wish to use (i.e. abiotic_simple or abiotic). The choice of models to be configured is indicated by including the required model names as top level entries in the model configuration. Note that the model name is required, even if the configuration uses all of the default settings. For example, this configuration specifies that six models are to be used, all with their default settings:

[core]  # optional
[soil]
[litter]
[hydrology]
[plants]
[abiotic]
[animals]

The [core] element is optional as the Virtual Ecosystem core module is always required and the default core settings will be used if it is omitted. It can be useful to include it as a reminder that a particular configuration is intentionally using the default settings. Each module configuration section can of course be expanded to change defaults.

Note

The order in which models are run is not something that can be controlled by users (i.e. execution order is not controlled by where models are placed in the configuration). As some models require outputs of the other models in order to run, there are hard constraints the order they can be run in. The simulation automatically chooses a valid model execution order during the configuration process.

Changing model constants#

The majority of constants included in the Virtual Ecosystem are universal and so are not expected to vary site to site. This means that you do not have to provide new values for them to set up a new site (though you are very welcome to change them if you disagree with our choices of values). However, some things that we include as “constants” are in fact site specific (e.g. the deposition rate of inorganic phosphorus), and you will have to change them for your site setup. To change the value of constant you need to provide an updated value for it within a configuration file, under a [model_name.constants] tag. This looks like:

[soil.constants]
phosphorus_deposition_rate = 2.0e-05 # High rate for Amazon Rainforest

You only need to provide values for constants that you wish to change (i.e. the site specific ones, and any for which you disagree with our choice of default values). All constants that you don’t provide values for will just use the default value. Details of all Virtual Ecosystem constants and their default values can be found in the model specific setup details documentation.

Changing the core simulation setup#

Next, you need to provide the core settings for your simulation runs. There are a large number of configuration options that you will need to decide on. However, to keep this tutorial to reasonable length we will focus on two of the most important, the spatial and temporal scales of the simulation.

The spatial scales of the simulation are controlled by the settings under [core.grid]. The Virtual Ecosystem expects coordinates in metres, so you should choose a projected coordinate system for your site of interest and define a set of grid cells to cover the area at a resolution appropriate for your data. You need to take real care when setting the spatial scale as the data you provide to the model has to be on the same scale, i.e. all input data must be for the same grid size, shape and extent.

Important

Do not use a geographic coordinate system - you cannot use degree coordinates with the Virtual Ecosystem.

The temporal settings are controlled by the settings under [core.timing]. The Virtual Ecosystem updates the simulation state at discrete intervals. You need to decide how long an interval to use and how many time steps to run. Again, you need to take real care when setting the temporal scale as any time varying input data (e.g. climate inputs) you provide to the model has to cover the time period that you want your simulation to run for.

These core simulation settings can be changed in the same way we previously changed constants, i.e.

[core.grid]
cell_area = 10000.0 # hectare grid cells (i.e. 10000.0 m^2)
cell_nx = 100 # 100 grid cells in x direction
cell_ny = 50 # 50 grid cells in y direction

[core.timing]
start_date = "2018-01-01" # Start date in YYYY-MM-DD format
run_length = "5 years" # Run for 5 years with default update time step (1 month)

Important

These details need to be consistent across all of the input data, so it may be useful to create a core site extents file that all your data preparation scripts can use to set these values.

Providing the data required to run your simulations#

The final step to setting up your simulations is adding the required data. This is both the data that defines the initial state of your study site and time series data for the forcing variables (e.g. climate data). This is a pretty complex step, so before we get into the details, we should briefly mention how the Virtual Ecosystem stores data.

The majority of variables in the Virtual ecosystem are stored in the data object (we will talk about the ones that aren’t later). Data is stored in the data object object in a format similar to netCDF, i.e. this is what the LMWC soil variable looks like in the example data:

Hide code cell source

ve_example_data["soil_cnp_pool_lmwc"]
<xarray.DataArray 'soil_cnp_pool_lmwc' (x: 9, y: 9, element: 3)> Size: 2kB
array([[[5.0000000e-03, 2.5000000e-04, 1.0000000e-05],
        [5.0000000e-03, 2.5000000e-04, 1.0000000e-05],
        [5.0000000e-03, 2.5000000e-04, 1.0000000e-05],
        [5.0000000e-03, 2.5000000e-04, 1.0000000e-05],
        [5.0000000e-03, 2.5000000e-04, 1.0000000e-05],
        [5.0000000e-03, 2.5000000e-04, 1.0000000e-05],
        [5.0000000e-03, 2.5000000e-04, 1.0000000e-05],
        [5.0000000e-03, 2.5000000e-04, 1.0000000e-05],
        [5.0000000e-03, 2.5000000e-04, 1.0000000e-05]],

       [[5.0000000e-03, 2.5000000e-04, 1.0000000e-05],
        [5.0781250e-03, 2.5390625e-04, 1.0156250e-05],
        [5.1562500e-03, 2.5781250e-04, 1.0312500e-05],
        [5.2343750e-03, 2.6171875e-04, 1.0468750e-05],
        [5.3125000e-03, 2.6562500e-04, 1.0625000e-05],
        [5.3906250e-03, 2.6953125e-04, 1.0781250e-05],
        [5.4687500e-03, 2.7343750e-04, 1.0937500e-05],
        [5.5468750e-03, 2.7734375e-04, 1.1093750e-05],
        [5.6250000e-03, 2.8125000e-04, 1.1250000e-05]],

...

       [[5.0000000e-03, 2.5000000e-04, 1.0000000e-05],
        [5.5468750e-03, 2.7734375e-04, 1.1093750e-05],
        [6.0937500e-03, 3.0468750e-04, 1.2187500e-05],
        [6.6406250e-03, 3.3203125e-04, 1.3281250e-05],
        [7.1875000e-03, 3.5937500e-04, 1.4375000e-05],
        [7.7343750e-03, 3.8671875e-04, 1.5468750e-05],
        [8.2812500e-03, 4.1406250e-04, 1.6562500e-05],
        [8.8281250e-03, 4.4140625e-04, 1.7656250e-05],
        [9.3750000e-03, 4.6875000e-04, 1.8750000e-05]],

       [[5.0000000e-03, 2.5000000e-04, 1.0000000e-05],
        [5.6250000e-03, 2.8125000e-04, 1.1250000e-05],
        [6.2500000e-03, 3.1250000e-04, 1.2500000e-05],
        [6.8750000e-03, 3.4375000e-04, 1.3750000e-05],
        [7.5000000e-03, 3.7500000e-04, 1.5000000e-05],
        [8.1250000e-03, 4.0625000e-04, 1.6250000e-05],
        [8.7500000e-03, 4.3750000e-04, 1.7500000e-05],
        [9.3750000e-03, 4.6875000e-04, 1.8750000e-05],
        [1.0000000e-02, 5.0000000e-04, 2.0000000e-05]]])
Coordinates:
  * x        (x) int64 72B 0 90 180 270 360 450 540 630 720
  * y        (y) int64 72B 0 90 180 270 360 450 540 630 720
  * element  (element) <U1 12B 'C' 'N' 'P'
Attributes:
    units:        kg m^-3
    description:  Carbon, nitrogen and phosphorus content of the low molecula...

Because the formats are so similar, input data must be provided as netCDF files, which are then added to the data object as part of the configuration process.

Input data dimensions#

The netCDF files that you provide will be arrays of data, e.g. initial values for soil nitrogen concentrations or above canopy air temperatures over time. Many variables will be arrays over multiple different dimensions (e.g. space and time). The array data that you provide must use dimensions that the Virtual Ecosystem recognises. We provide detailed description of these critical dimensions (or core axes) elsewhere, but in short the possible dimensions are:

  • spatial: This is actually a kind of aggregate dimension, because spatial data can use cell_id or x and y coordinates - these two things map onto each other (see the core.grid configuration settings for details).

  • time: This dimension is used to index time steps along configured time extent for the simulation. Some variables only need to set the initial conditions and do not need a time axis, but other forcing variables (like temperature and precipitation) need to supply a value for each cell at each time step.

  • pft: Some data requires values per plant functional type. An example is the initial number of propagules per PFT in grid cells.

  • layer: Some data varies vertically by canopy layer (e.g. temperature), and this dimension captures that variation. This dimension is primarily used for variables generated during the model run, so you are unlikely to need to use it for input data (unless you are running models in static mode).

Preparing your array data input files#

The first thing you need to do to prepare your files is to look at the required variables for each science model that you want to include in the simulation and make a list of those variables.

Details of the variables required to setup each model can be found in the data variables page. Note that you only have to provide and configure the input variables shown in that table in bold. The other setup variables for a model will have been calculated by the setup process of earlier models.

Warning

The axis field in that data is currently not to be trusted - we have not systematically reviewed that data and there isn’t any internal checking that the stated axes are what is on the data.

Then for each variable you will need to compile appropriate data - given the axes required - and saved as NetCDF files, providing labelled dimensions and coordinates to match input data to the axes and coordinates of your model configuration. The process for compiling this data varies dramatically by model, and you should refer to the model specific setup documentation to understand how to compile data for the specific models you are interested in. You can also consult the example data page for examples of NetCDF input files.

Important

Input variables are usually clearly thematically linked to the scientific domain of a single model. However, in some cases models require less obvious data. For example:

  • The plants model requires shortwave downwelling radiation. Although this seems like an abiotic variable, it is required for modelling plant growth and the partitioning of radiation within the canopy is calculated by the plants model.

  • The animal model requires fungal fruiting body densities for consumption by fungivores. The soil and litter models update these values but the data is first required by the animal model.

Collecting data for a simulation is likely to involve a data science team with different domain knowledge for the different models. You may not want to break down data collection tasks strictly by model and instead identify variables that may require domain knowledge from elsewhere in the team.

Configuring array data inputs#

Once you have your input data files, you will then need to add the data to your model configuration. This is done using the core.data.variable configuration section: for each variable, you need to include a configuration section giving the variable name and then the data file in which the variable is found. Note that you can have multiple variables in a single NetCDF file.

As an example, the following TOML gives the configuration for loading two climatic data variables stored in the same file:

[[core.data.variable]]
file_path = "../data/example_climate_data.nc"
var_name = "air_temperature_ref"
[[core.data.variable]]
file_path = "../data/example_climate_data.nc"
var_name = "relative_humidity_ref"

All file paths that you provide must be valid paths to netCDF files. Configuration errors will also occur if any of the variable names (var_name) you provide are not found in the associated netCDF file. Finally, if the dimension lengths or any coordinates (such as x and y locations) of a variable are not compatible with the model configuration then a configuration error will occur.

Other data inputs#

Some initial model data does not use the main data loading system. This is typically where the data does not map neatly onto one of the core axes mentioned above. These data will have specific model configuration settings. For example:

  • The plants model requires a set of defined plant functional types (PFTs). This is a CSV file defining a set required trait values for each PFT, and the path to this file is set in the plants model configuration options

  • The plant model also requires a defined initial cohort structure, which sets the initial cohorts present in each cell. This again is defined as a CSV file with the path set in the plants model configuration options

  • The animal model also requires a set of defined functional groups. These are defined in a CSV file with the path provided as part of the animal model configuration.

  • The soil model requires parameter estimates for the microbial functional groups and enzyme classes that it uses. However, they are added as part of the configuration (in a similar manner to model constants) rather than in a data file.

There is no generic system for reading in CSV data, instead a path to each file needs to be provided as part of the configuration of the relevant model, e.g.

[plants]
cohort_data_path = "../data/example_plant_cohorts.csv"
pft_definitions_path = "../data/plant_pfts.csv"

[animal]
functional_group_definitions_path = '../data/animal_functional_groups.csv'

List of required data files#

To close this tutorial, we will briefly recap the full set of files that you need to provide:

  • You must provide a folder of toml files containing the configuration settings for Virtual Ecosystem. These files can be named whatever you like (though you should aim to give them easy to understand names). You can split the configuration settings over as many or as few files as you like (though you want to ensure that purpose of each individual file is obvious). A path to this folder has to be provided when using ve_run.

  • You must provide array data for every variable that is required to setup or update the models you wish to run, except the ones that are populated by one of the other models. To figure out what these variables are you should consult the data variables page. These variables can be provided over as many or as few files as you wish, but again you need to make sure that the purpose of each file is clear. For the example data we chose to split by model, but if a different split makes more sense for your use case you should use that instead. Paths to each of these files then need to be provided as part of the configuration.

  • You also need to provide three csv files. One defining plant functional types, one giving the location and density of the plant cohorts, and one giving the location and density of the animal cohorts. You can name these files whatever you like (again, choose sensible names), but need to provide paths to each of them as part of the configuration.