Subsetting FASTQ files to a fixed number of records.
Level: intermediate
In the subsampling files recipe, we showed how to output a file with a specific number of lines in it based only on the output filename. What if we want to sample a specific number of records from a FASTQ file? To do this we need to transform the number of records in a wildcard into the number of lines.
To do this, snakemake supports functions in its params:
blocks (ref
CTB XXX params blocks). In the following recipe, we calculate the
number of lines to sample based on the number of records specified
in the num_records
wildcard:
def calc_num_lines(wildcards):
# convert wildcards.num_records to an integer:
num_records = int(wildcards.num_records)
# calculate number of lines (records * 4)
num_lines = num_records * 4
return num_lines
rule all:
input:
"big.subset25.fastq"
rule subset:
input:
"big.fastq"
output:
"big.subset{num_records}.fastq"
params:
num_lines = calc_num_lines
shell: """
head -{params.num_lines} {input} > {output}
"""
There are two special components here:
- the Python function
calc_num_lines
takes a wildcards object as a parameter, and calculates the number of lines to subset based on the value ofwildcards.num_records
; - then, the
params:
block appliescalc_num_lines
to generateparams.num_lines
, which can then be used in the shell command.
References:
- CTB params
- CTB namespaces
- CTB python code
Using lambda
The recipe above is pretty long - you can make a much shorter (but also harder to understand!) Snakefile using using anonymous "lambda" functions:
rule all:
input:
"big.subset25.fastq"
rule subset:
input:
"big.fastq"
output:
"big.subset{num_records}.fastq"
params:
num_lines = lambda wildcards: int(wildcards.num_records) * 4
shell: """
head -{params.num_lines} {input} > {output}
"""
Here, lambda
creates an anonymous function that takes a single parameter,
wildcards
, and returns the value of wildcards.num_records
multipled by
4.