Using configuration files
Configuration files are a snakemake feature that can be used to separate the rules in the workflow from the configuration of the workflow. For example, suppose that we want to run the same sequence trimming workflow on many different samples. With the techniques we've seen so far, you'd need to change the Snakefile each time; with config files, you can keep the Snakefile the same, and just provide a different config file for each new sample. Config files can also be used to define parameters, or override default parameters, for specific programs being run by your workflow.
A first example - running a rule with a single sample ID
Consider this Snakefile, which create an output file based on a
sample ID. Here the sample ID is taken from a config file and provided
via the Python dictionary named config
:
configfile: "config.one_sample.yml"
SAMPLE=config['sample']
rule all:
output:
expand("one_sample.{s}.out", s=SAMPLE)
shell:
"touch {output}"
The default configuration file is config.one_sample.yml
, which
sets config['sample']
to the value XYZ_123
, and creates
one_sample.XYZ_123.out
:
sample: XYZ_123
However, the configfile:
directive in the Snakefile can be overriden
on the command line by using --configfile
; consider the file
config.one_sample_b.yml
:
sample: ABC_456
If we now run snakemake -s snakefile.one_sample --configfile config.one_sample_b.yml -j 1
, the value of sample will be set to
ABC_456
, and the file one_sample.ABC_456.out
will be created.
(CTB: assert that the appropriate output files are created.)
Specifying multiple sample IDs in a config file
The previous example only handles one sample at a time, but there's
no reason we couldn't provide multiple, using YAML lists. Consider
this Snakefile, snakefile.multi_samples
:
configfile: "config.multi_samples.yml"
SAMPLES=config['samples']
rule all:
input:
expand("one_sample.{s}.out", s=SAMPLES)
rule make_single_sample_wc:
output:
"one_sample.{s}.out"
shell:
"touch {output}"
and this config file, config.multi_samples.yml
:
samples:
- DEF_789
- GHI_234
- JKL_567
Here, we're creating multiple output files, using a more complicated setup.
First, we use samples
from the config file. The config['samples']
value
is a Python list of strings, instead of a Python string, as in the previous
sample; that's because the config file specifies samples
as a list in
the config.multi_samples.yml
file.
Second, we switched to using a wildcard rule in the Snakefile, because we want to run one rule on many files; this has a lot of benefits!
Last but not least, we provide a default rule that
uses the expand
function with a single pattern and one list of values to construct
the list of output files for the wildcard rule to make.
Now we can either edit the list of samples in the config file, or we can provide different config files with different lists of samples!
Specifying input spreadsheets via config file
Specifying command line parameters in a config file
note, might want to have some info on parameters in output files.
Providing config variables on the command line
Debugging config files
print, pprint keys defaults
Recap
With config files, you can:
- separate configuration from your workflow
- provide multiple different config files for the same workflow
- change the samples by editing a YML file instead of a Snakefile
- make it easy to validate your input configuration (DISCUSS)
Leftovers
- Point to official snakemake docs
- Guide to YAML and JSON syntax