Subsampling FASTQ files
Level: beginner+
In Using wildcards to generalize your rules, we introduced the use of wildcards to generate
rule all:
input:
"big.subset100.fastq"
rule subset:
input:
"big.fastq"
output:
"big.subset{num_lines}.fastq"
shell: """
head -{wildcards.num_lines} {input} > {output}
"""
Ref:
- wildcards
Subsampling records rather than lines
Here, one potential problem is that we are producing subset files based on the number of lines, not the number of records - typically, in FASTQ files, four lines make a record. Ideally, the subset FASTQ file produced by the recipe above would have the number of records in its filename, rather than the number of lines! However, this requires multiplying the number of records by 4!
You can do this using params:
functions,
which let you introduce Python functions into your rules.