Subsetting FASTQ files to a fixed number of records.

Level: intermediate

In the subsampling files recipe, we showed how to output a file with a specific number of lines in it based only on the output filename. What if we want to sample a specific number of records from a FASTQ file? To do this we need to transform the number of records in a wildcard into the number of lines.

To do this, snakemake supports functions in its params: blocks (ref CTB XXX params blocks). In the following recipe, we calculate the number of lines to sample based on the number of records specified in the num_records wildcard:

def calc_num_lines(wildcards):
    # convert wildcards.num_records to an integer:
    num_records = int(wildcards.num_records)

    # calculate number of lines (records * 4)
    num_lines = num_records * 4

    return num_lines

rule all:
    input:
        "big.subset25.fastq"

rule subset:
    input:
        "big.fastq"
    output:
        "big.subset{num_records}.fastq"
    params:
        num_lines = calc_num_lines
    shell: """
        head -{params.num_lines} {input} > {output}
    """

There are two special components here:

the Python function calc_num_lines takes a wildcards object as a parameter, and calculates the number of lines to subset based on the value of wildcards.num_records;
then, the params: block applies calc_num_lines to generate params.num_lines, which can then be used in the shell command.

References:

CTB params
CTB namespaces
CTB python code

Using lambda

The recipe above is pretty long - you can make a much shorter (but also harder to understand!) Snakefile using using anonymous "lambda" functions:

rule all:
    input:
        "big.subset25.fastq"

rule subset:
    input:
        "big.fastq"
    output:
        "big.subset{num_records}.fastq"
    params:
        num_lines = lambda wildcards: int(wildcards.num_records) * 4
    shell: """
        head -{params.num_lines} {input} > {output}
    """

Here, lambda creates an anonymous function that takes a single parameter, wildcards, and returns the value of wildcards.num_records multipled by 4.

An Introduction to Snakemake for Bioinformatics

Subsetting FASTQ files to a fixed number of records.

Using lambda