Configuration

KCS uses a configuration file that sets most of the default settings. The configuration file is in TOML format, as an attempt to find a middle ground between more complicated (e.g. YAML) and very simple (INI) formats.

Note that this configuration provides the default settings; these would usually be constants in a Python script. Often, instead of providing a configuration file, it is clearer (and easier) to supply extra or changed options on the command line (as in the usage guide), or as arguments to a function call (as in the example Python script).

The default configuration file is, in fact, a long Python string, and can be found in kcs.config.default as CONFIG_TOML_STRING. You should, however, rarely need to access it directly. Instead, use the kcs.config.default_config object, which is a Python (nested) dictionary that maps directly to this TOML string.

If you only use the command line runnable modules, you can obtain the TOML configuration file by executing

python -m kcs.config > my-config.toml

This will dump the configuration file into the my-config.toml file, which can be edited to your liking, and then used with other commands using the -C or --config option, e.g.

python -m    python -m kcs.tas_change  @cmip-tas-global-averaged.list \
    --outfile=tas_change_cmip.csv --config my-config.toml

The default configuration file is commented in various places, and is also listed at the end of this section. (The comments are the reason the default configuration is in the form of a string: keeping it as a Python object would lose the comments once transformed into a TOML string or file.)

Note that in TOML, some string rules apply to values: strings should always be quoted (with double quote characters). It is possible to have multiline strings, with triple (double) quotes, similar to Python. Floating point values should always contain a fractional part or exponent (or both); otherwise, the value is interpreted as an integer, or invalid. This should be clear from the default configuration file, where, for example, the percentiles are all floating points, even though the default percentiles happen to be all integer values. For more details, see the TOML specification.

The configuration file is made up of several sections; these correspond one-to-one to the main keys in the default_config object. These sections can be in random order, and are:

  • scenario

    Definitions belonging to the requrested scenarios, such as the names, epochs, percentiles to match with the annual and percentage of precipitation change.

  • areas

    Definitions of the possible areas of concern. For rectangles, use w, e, s and n indicators, though it is also possible to use lat and lon with a 2-element array of values. For single points, use just lat and lon with a single value. Note that all these values should be floating point values. For a global area, there is a special string value "global".

  • variables

    Some definitions about variables. Currently, it only sets a list of short variable names for which relative changes should be calculated, such as pr.

  • statistics

    Which statistics (mean, percentiles) to calculate. There are two subsections, one for the change in temperature (tas_change), which leads to the calculation of the steering table; and the statistics for the local, regional, changes in e.g. precipitation and temperature. (Note: this is not yet fully implemented, end of March 2020.)

  • resampling

    This sets the most important parameters for the three resampling steps. The conditions for resampling step 2 are given through another file, as this would be a lengthy section in itself. An example of this file can be found in the config/ directory. There is not yet a completely built-in configuration for this file, that is, at least the default should be copied into the current working directory (it will then automatically be found and read).

  • data

    This section is a long one, with various definitions about the input data sets. It is split into a few sections: one about cmip data, one about “extra” data (concerning the model of interest), and one for matching datasets (i.e., historical and future experiments). The latter has subsections in the cmip and extra subsections, which can override the base “data.matching” settings. There is also a “data.filenames” section, which defines the filename format for various dataset through a regular expression pattern. This is needed when attributes are deduced from the filenames (e.g., if the attributes are not available directly through a data file).

  • plotting

    This section is defined, but its functionality is, as of end of March 2020, not yet implemented. Once implemented, this allows one to control e.g. the figure size, colors, transparency (alpha) and a few other options.

If you want to make changes, use the outputted configuration file, edit it, then supply it on the command line or read it explicitly with kcs.config.read_config("my-config.toml") (for use of the latter, see the Script example).

It is possible to only output a single section from the default configuration, edit that, and supply that as configuration file. This shortens the local (user) configuration file considerably, and may make it clearer what has been changed. Supply the section name after the kcs.config module to output just that section, for example:

python -m kcs.config resample > resample-config.toml

Note that configuration files should always have complete sections: there is no guarantee (at least, not yet) that missing key-value settings will be taken from the built-in configuration. While this can make it less clear what changes have been made in the user configuration file, it keeps the configuration for each section nicely together.

Default user configuration

It is possible to have a default user configuration file in the current working directory, that is automatically read by scripts (runnable modules). That is, therer is no need to supply the --config <my-config-file.toml> option.

This has an advantage and disadvantage: the advantage is cleaner scripts, that are not befuddled with options that can confuse a reader. The disadvantage is that if such a default user configuration exists in one place, but not in another place (either another working directory, or on the machine of a completely other user), the configuration and thus the results, may be quite different. If you use this feature, always make sure this configuration is supplied when publishing the results (even if just emailing the results to a colleague).

This default user configuration file should be called kcs-config.toml and be located in the current working directory. For the (disadvantage) reason stated above, it is not a hidden file, nor is it located user-wide in, for example, the user’s home directory or an XDG configuration directory.

The kcs-config.toml file is not automatically read when using a Python script; it only works with command line scripts, as in the usage guide. When using a Python script, you can still read this file either explicitly with kcs.config.read_config("kcs-config.toml"), or implicitly with kcs.config.read_config(). The latter will attempt to read the default user configuration file, and if not found, revert back to the built-in configuration.

Full annotated configuration

(Valid end of March 2020)

Below is the full configuration listed, with comments annotating various parts. This is the same that one would get from running python -m kcs.config, and, in fact, the same as kcs.config.default.CONFIG_TOML_STRING.

# Notes about values in the TOML config file:
# - strings should always be quoted. Multi-line strings are possible
#   using triple-quotes, but should not be necessary in this config file
# - floating point values require a decimal dot, or an exponent (or
#   both). Otherwise, they are interpreted as integers.
# More information about the TOML config format: https://github.com/toml-lang/toml



[scenario]

[scenario.defs]
# Note: extra definitions for a name or precip do not have to be commented out below, even if the corresponding
# scenarios are commented-out above

# percentile of CMIP tas for a given epoch that the 'W' and 'G' scenarios correspond to
W = 90.0
G = 10.0
# percent change in precipitation (times global tas increase) for 'L' and 'H' scenarios
L = 4.0
H = 8.0


# Scenarios to use.
# This defines only the names and epoch.
# Values (what G/W and L/H correspond to) are filled in below.
# Comment-out scenarios of no-interest.
[scenario.G_L2050]
name = "G"
epoch = 2050
precip = "L"
[scenario.W_L2050]
name = "W"
epoch = 2050
precip = "L"
[scenario.G_H2050]
name = "G"
epoch = 2050
precip = "H"
[scenario.W_H2050]
name = "W"
epoch = 2050
precip = "H"
[scenario.G_L2085]
name = "G"
epoch = 2085
precip = "L"
[scenario.W_L2085]
name = "W"
epoch = 2085
precip = "L"
[scenario.G_H2085]
name = "G"
epoch = 2085
precip = "H"
[scenario.W_H2085]
name = "W"
epoch = 2085
precip = "H"


[areas]
# Define areas of interest
# Use w, e, s, n identifiers in a inline-table/map/dict for a rectangular area
# Or lat, lon in a inline-table/map/dict for a single point
# Or the special value "global"
# Values for w/e/s/n/lat/lon should all be floating point.
# Shapefiles and masks are not yet supported.
global = "global"
nlpoint = {lat = 51.25, lon = 6.25}
nlbox = {w = 4.5, e = 8.0, s = 50.5, n = 53.0}
weurbox = {w = 4.0, e = 14.0, s= 47.0, n = 53.0}
rhinebasin = {w = 6.0, e = 9.0, n = 52.0, s = 47.0}

[variables]
# Some generic configuration regarding variables

# For which variables to calculate a relative change,
# instead of an absolute change (compare 'pr' versus 'tas')
# A list of short variable names.
relative = ["pr"]


[data]
# Some generic configuration for the input data
# This assumes NetCDF files with proper (CF-conventions) attributes

[data.attributes]
# Define the attribute names for meta information.
# Each definition should be a list: this allows to handle different
# conventions (e.g., between CMIP5 and CMIP6) if one is not available.
experiment = ["experiment_id"]
model = ["model_id", "source_id"]
realization = ["realization", "realization_index"]
initialization = ["initialization_method", "initialization_index"]
physics = ["physics_version", "physics_index"]
prip = ["parent_experiment_rip", "parent_variant_label"]
var =  ["variable_id"]

# What is the attribute value that indicates historical experiments?
# Everything else is assumed to be a future experiment.
# This value is case-insensitive.
historical_experiment = "historical"

[data.extraction]
template = "data/{var}-{area}-averaged/{filename}.nc"


[data.filenames]
# Definitions of filename patterns, to obtain attribute information from.
# Several are given, for various conventions.
# All are tried, until a match is found.

# Regexes can be notably hard to read, especially in this case, since every
# blackslash needs to be escaped, resulting in lots of double backslashes.

[data.filenames.esmvaltool]
pattern = """^CMIP\\d_\
(?P<model>[-A-Za-z0-9]+)_\
(?P<mip>[A-Za-z]+)_\
(?P<experiment>[A-Za-z0-9]+)_\
r(?P<realization>\\d+)\
i(?P<initialization>\\d+)\
p(?P<physics>\\d+)_\
(?P<var>[a-z]+)_\
.*\\.nc$\
"""

[data.filenames.cmip5]
pattern = """^\
(?P<var>[a-z]+)_\
(?P<mip>[A-Za-z]+)_\
(?P<model>[-A-Za-z0-9]+)_\
(?P<experiment>[A-Za-z0-9]+)_\
r(?P<realization>\\d+)\
i(?P<initialization>\\d+)\
p(?P<physics>\\d+)_\
.*\\.nc$\
"""

[data.filenames.cmip6]
pattern = """^\
(?P<var>[a-z]+)_\
(?P<mip>[A-Za-z]+)_\
(?P<model>[-A-Za-z0-9]+)_\
(?P<experiment>[A-Za-z0-9]+)_\
r(?P<realization>\\d+)\
i(?P<initialization>\\d+)\
p(?P<physics>\\d+)\
f\\d+_\
gn_\
.*\\.nc$\
"""

[data.filenames.ecearth]
pattern = """^\
(?P<var>[a-z]+)_\
(?P<mip>[A-Za-z]+)_\
(?P<model>[-A-Za-z0-9]+)_\
(?P<experiment>[A-Za-z0-9]+)_\
.*\\.nc$\
"""


[data.cmip]
# Configuration for everything that considers CMIP data


# Actual timespan a given scenario epoch corresponds to
periods = {2050 = [2036, 2065], 2085 = [2071, 2100]}

# Control period defines the reference period to which to compare (and
# possibly normalize) to.
# CMIP5 would be [1980, 2009], CMIP6 would be [1990, 2019]
control_period = [1990, 2019]

# Normalize the CMIP data to the control period.
# Choices are "model", "experiment" or "run". These options vary from
# the most to the least spread of normalized data around the control period.
# Leave blank to not normalize (usually a bad idea).
norm_by = "run"

# Calculate the tas change for a specific seasonal average, or a yearly average
# Choices are "year", "djf", "mam", "jja", "son".
season = "year"


[data.matching]
# Configuration how to match and concatenate CMIP historical and future experiments

# Match future and historical runs by model (very generic) or ensemble (very specific).
by = "ensemble"

# Where to get the match info from. Either (NetCDF) 'attributes' or the 'filename' pattern
# Should always be a list: the later options in the list serve as a fallback in case earlier
# options don't succeed
info_from = ["attributes", "filename"]

# What to do when a future ensemble can't be matched:
# - "error": raise an error, and stop the program
# - "remove": remove (ignore) the future ensemble
# - "randomrun": pick a random historical run that matches all attributes, except the realization
# - "random": pick a random historical run from all ensembles of that model
on_fail = "randomrun"

[data.cmip.matching]
# Empty: all values are taken from [data.matching]


[data.extra]
# Configuration for additional data
# This is the data of interest, for which a steering table will be
# calculated, and whose runs will be resampled.

control_period = [1990, 2019]

[data.extra.matching]
# An empty string means the datasets are already concatenated datasets: historical + future.
by = ""

# Any other key not defined here (match_info_from, on_no_match) are taken from [data.matching]



[statistics]
# Which statistics (mean and percentiles) to calculate.
# There are two subsets:
# 1. the statistics for the tas change (which leads to the steering table)
# 2. the statistics for the seasonal-regional change plots

[statistics.tas_change]
# A list of floating point numbers (matching between these floating
# point numbers for finding the right percentiles, is done with an
# accuracy of 0.0001 tolerance).
percentiles = [5.0, 10.0, 25.0, 50.0, 75.0, 90.0, 95.0]
# Calculate the mean as well
mean = true

[statistics.regional_changes]
percentiles = [5.0, 10.0, 25.0, 50.0, 75.0, 90.0, 95.0]
mean = true


[plotting]
figsize = [8.0, 8.0]

[plotting.tas_increase]
# Configuration for global temperature increase plot

# CMIP percentiles levels to plot
# Each level needs a name, to be re-used in the styline configuration
levels = {extreme = [5.0, 95.0], middle = [10.0, 90.0], narrow = [25.0, 75.0]}

# Smooth the plot across time; value should be an integer. Leave blank for no smoothing.
rolling_window = 10

[plotting.tas_increase.extra_data]
# Overplot extra datasets?
overplot = true
# Plot averaged data, or individual runs
average_data = true
# Smooth with rolling window; same as for CMIP data
rolling_window = 10


[plotting.tas_increase.labels]
x = "Year"
y = "Increase [${}^{\\circ}$]"

[plotting.tas_increase.range]
# Use a 2-element list. Leave blank to let Matplotlib figure things out.
x = [1950, 2100]  # years, in integers
y = [-1.0, 6.0]   # always float

[plotting.tas_increase.styles]
# Re-use the level names above.
# Use Matplotlib color and alpha codes.
# Anything not given is 'black' (color) and 1.0 (alpha; opaque).
colors = {extreme = "#bbbbbb", middle = "#888888", narrow = "#555555", extra_data = "#669955"}
alpha = {extreme = 0.8, middle = 0.4, narrow = 0.2}



[resampling]

nsections = 6
nstep1 = 1000
nstep3 = 8
# Monte-Carlo number of samples
nsample = 10_000

# Number of simultaneous processes for the calculations of step 1.
# Note that for relatively few input runs (< 12), the overhead
# generally costs more than multiprocessing wins.
nproc = 1

# TOML file that defines the percentiles ranges used in step 2
step2_conditions = "step2.toml"


# Penalties for number of (multiple) occurrences of segment in resamples, in step 3.
# Starts from 1 occurrence, that is, no duplicate.
# Only give the number of occurrences that have a penalty less than
# infinity, including a 0.0 penalty (for e.g. a single, `1`, occurrence).
# All penalties should be floating point numbers.
penalties = {1 = 0.0, 2 = 0.0, 3 = 1.0, 4 = 5.0}