Using configuration files

Configuration files are a snakemake feature that can be used to separate the rules in the workflow from the configuration of the workflow. For example, suppose that we want to run the same sequence trimming workflow on many different samples. With the techniques we've seen so far, you'd need to change the Snakefile each time; with config files, you can keep the Snakefile the same, and just provide a different config file for each new sample. Config files can also be used to define parameters, or override default parameters, for specific programs being run by your workflow.

A first example - running a rule with a single sample ID

Consider this Snakefile, which create an output file based on a sample ID. Here the sample ID is taken from a config file and provided via the Python dictionary named config:

configfile: "config.one_sample.yml"

SAMPLE=config['sample']

rule all:
    output:
        expand("one_sample.{s}.out", s=SAMPLE)
    shell:
        "touch {output}"

The default configuration file is config.one_sample.yml, which sets config['sample'] to the value XYZ_123, and creates one_sample.XYZ_123.out:

sample: XYZ_123

However, the configfile: directive in the Snakefile can be overriden on the command line by using --configfile; consider the file config.one_sample_b.yml:

sample: ABC_456

If we now run snakemake -s snakefile.one_sample --configfile config.one_sample_b.yml -j 1, the value of sample will be set to ABC_456, and the file one_sample.ABC_456.out will be created.

(CTB: assert that the appropriate output files are created.)

Specifying multiple sample IDs in a config file

The previous example only handles one sample at a time, but there's no reason we couldn't provide multiple, using YAML lists. Consider this Snakefile, snakefile.multi_samples:

configfile: "config.multi_samples.yml"

SAMPLES=config['samples']

rule all:
    input:
        expand("one_sample.{s}.out", s=SAMPLES)

rule make_single_sample_wc:
    output:
        "one_sample.{s}.out"
    shell:
        "touch {output}"

and this config file, config.multi_samples.yml:

samples:
- DEF_789
- GHI_234
- JKL_567

Here, we're creating multiple output files, using a more complicated setup.

First, we use samples from the config file. The config['samples'] value is a Python list of strings, instead of a Python string, as in the previous sample; that's because the config file specifies samples as a list in the config.multi_samples.yml file.

Second, we switched to using a wildcard rule in the Snakefile, because we want to run one rule on many files; this has a lot of benefits!

Last but not least, we provide a default rule that uses the expand function with a single pattern and one list of values to construct the list of output files for the wildcard rule to make.

Now we can either edit the list of samples in the config file, or we can provide different config files with different lists of samples!

Specifying input spreadsheets via config file

Specifying command line parameters in a config file

note, might want to have some info on parameters in output files.

Providing config variables on the command line

Debugging config files

print, pprint keys defaults

Recap

With config files, you can:

  • separate configuration from your workflow
  • provide multiple different config files for the same workflow
  • change the samples by editing a YML file instead of a Snakefile
  • make it easy to validate your input configuration (DISCUSS)

Leftovers

  • Point to official snakemake docs
  • Guide to YAML and JSON syntax