input:
and output:
blocks
@@ make a note somewhere that these are annotations, not directives,
and that's why we suggest using {output}
.
@@ make a note saying that if it wants one output, it will run the rule.
As we saw in Chapter 2, snakemake will automatically "chain" rules by connecting inputs to outputs. That is, snakemake will figure out what to run in order to produce the desired output, even if it takes many steps.
In Chapter 3, we also saw that snakemake will fill
in {input}
and {output}
in the shell command based on the contents
of the input:
and output:
blocks. This becomes even more useful
when using wildcards to generalize rules, as shown in
Chapter 6, where wildcard values are properly
substituted into the {input}
and {output}
values.
Input and output blocks are key components of snakemake workflows. In this chapter, we will discuss the use of input and output blocks a bit more comprehensively.
Providing inputs and outputs
As we saw previously, snakemake will happily take multiple input and output values via comma-separated lists and substitute them into strings in shell blocks.
rule example:
input:
"file1.txt",
"file2.txt",
output:
"output file1.txt",
"output file2.txt",
shell: """
echo {input:q}
echo {output:q}
touch {output:q}
"""
When these are substituted into shell commands with {input}
and
{output}
they will be turned into space-separated ordered lists:
e.g. the above shell command will print out first file1.txt file2.txt
and then output file1.txt output file2.txt
before using touch
to
create the empty output files.
In this example we are also asking snakemake to quote filenames for
the shell command using :q
- this means that if there are spaces,
characters like single or double quotation marks, or other characters
with special meaning they will be properly escaped using
Python's shlex.quote function.
For example, here both output files contain a space, and so touch {output}
would create three files -- output
, file1.txt
, and
file2.txt
-- rather than the correct two files, output file1.txt
and output file2.txt
.
Quoting filenames with {...:q}
should always be used for anything
executed in a shell block - it does no harm and it can prevent
serious bugs!
In the above code example, you will notice that "file2.txt"
and
"output file2.txt"
have commas after them:
rule example:
input:
"file1.txt",
"file2.txt",
output:
"output file1.txt",
"output file2.txt",
shell: """
echo {input:q}
echo {output:q}
touch {output:q}
"""
Are these required? No. The above code is equivalent to:
rule example:
input:
"file1.txt",
"file2.txt"
output:
"output file1.txt",
"output file2.txt"
shell: """
echo {input:q}
echo {output:q}
touch {output:q}
"""
where there are no commas after the last line in input and output.
The general rule is this: you need internal commas to separate items
in the list, because otherwise strings will be concatenated to each
other - i.e. "file1.txt" "file2.txt"
will become "file1.txtfile2.txt"
,
even if there's a newline between them! But a comma trailing after the
last filename is optional (and ignored).
Why!? These are Python tuples and you can add a trailing comma if
you like: a, b, c,
is equivalent to a, b, c
. You can read more
about that syntax here (CTB link to specific
section).
So why do we add a trailing comma?! I suggest using trailing commas because it makes it easy to add a new input or output without forgetting to add a comma, and this is a mistake I make frequently! This is a (small and simple but still useful) example of defensive programming, where we can use optional syntax rules to head off common mistakes.
Inputs and outputs are ordered lists
We can also refer to individual input and output entries by using square brackets to index them as lists, starting with position 0:
rule example:
...
shell: """
echo first input is {input[0]:q}
echo second input is {input[1]:q}
echo first output is {output[0]:q}
echo second output is {output[1]:q}
touch {output}
"""
However, we don't recommend this because it's fragile. If you change the order of the inputs and outputs, or add new inputs, you have to go through and adjust the indices to match. Relying on the number and position of indices in a list is error prone and will make changing your Snakefile harder!
Using keywords for input and output files
You can also name specific inputs and outputs using the keyword
syntax, and then refer to those using input.
and output.
prefixes.
The following Snakefile rule does this:
rule example:
input:
a="file1.txt",
b="file2.txt",
output:
a="output file1.txt",
c="output file2.txt"
shell: """
echo first input is {input.a:q}
echo second input is {input.b:q}
echo first output is {output.a:q}
echo second output is {output.c:q}
touch {output:q}
"""
Here, a
and b
in the input block, and a
and c
in the output block,
are keyword names for the input and output files; in the shell command,
they can be referred to with {input.a}
, {input.b}
, {output.a}
, and
{output.c}
respectively. Any valid variable name can be used, and the
same name can be used in the input and output blocks without collision,
as with input.a
and output.a
, above, which are distinct values.
This is our recommended way of referring to specific input and output files. It is clearer to read, robust to rearrangements or additions, and (perhaps most importantly) can help guide the reader (including "future you") to the purpose of each input and output.
If you use the wrong keyword names in your shell code, you'll get an error message. For example, this code:
rule example:
input:
a="file1.txt",
output:
a="output file1.txt",
shell: """
echo first input is {input.z:q}
"""
gives this error message:
AttributeError: 'InputFiles' object has no attribute 'z', when formatting the following:
echo first input is {input.z:q}
Example: writing a flexible command line
One example where it's particularly useful to be able to refer to
specific inputs is when running programs on files where the input
filenames need to be specified as optional arguments. One such
program is the megahit
assembler when it runs on paired-end input
reads. Consider the following Snakefile:
rule all:
input:
"assembly_out"
rule assemble:
input:
R1="sample_R1.fastq.gz",
R2="sample_R2.fastq.gz",
output:
directory("assembly_out")
shell: """
megahit -1 {input.R1} -2 {input.R2} -o {output}
"""
In the shell command here, we need to supply the input reads as two
separate files, with -1
before one and -2
before the second. As a
bonus the resulting shell command is very readable!
Input functions and more advanced features
There are a number of more advanced uses of input and output that rely on Python programming - for example, one can define a Python function that is called to generate a value dynamically, as below -
def multiply_by_5(w):
return f"file{int(w.val) * 5}.txt"
rule make_file:
input:
# look for input file{val*5}.txt if asked to create output{val}.txt
filename=multiply_by_5,
output:
"output{val}.txt"
shell: """
cp {input} {output:q}
"""
When asked to create output5.txt
, this rule will look for
file25.txt
.
Since this functionality relies on knowledge of wildcards as well as some knowledge of Python, we will defer discussion of it until later!