Techniques for debugging workflow execution (and fixing problems!)

Some initial words of wisdom

Debugging complex computer situations is an art -- or, at least, it is not easily systematized. There are guidelines and even rules to debugging, but no single procedure or approach that is guaranteed to work.

This chapter focuses on how to debug snakemake workflows. The odds are that you're reading this chapter because you are trying hard to get something to work. Heck, you're probably only reading this sentence because you're desperate.

Below are the most useful pieces of advice we can give you about debugging at this point in your snakemake journey.

First, simplify the workflow as much as possible so that it is fast to run. For example, reduce the number of samples to 1 or 2 (@@) and subsample input files so that they are small. This will make it faster to run and decrease the time between testing your results.

Second, focus on one rule at a time. Run each rule, one by one, until you find one that is not doing what you want it to do. Then focus on fixing that. This will provide you with an increasingly solid path through the snakemake rules.

Third, print out the commands being run (using -p) and examine the wildcards in the snakemake output carefully. Make sure both the commands and the wildcard values are what you expect. Find the first rule where they aren't and fix that rule. This will ensure that at each stage, your wildcards are ...

The three stages of snakemake debugging

There are three common stages of debugging you'll encounter when creating or modifying a snakemake workflow.

First, you'll have syntax errors caused by mismatched indentation and whitespace, as well as mismatched quotes. These errors will prevent snakemake from reading your Snakefile.

Second, you'll find problems connecting rules and filling in wildcards. This will prevent snakemake from executing any jobs.

And third, you'll have actual execution errors that make specific rules or jobs fail. These errors will prevent your workflow from finishing.

This chapter will cover the sources of the most common types of these errors, and will also provide tips and techniques for avoiding or fixing many of them.

  • intermediate targets
  • debug-dag
  • logs
  • print in Snakefile (use file=)
  • finding and reading error messages - silence, killed, etc.
  • running in single-CPU mode
  • whitespace
  • filling in wildcards
  • use --until to specify a rule to go to
  • focus on one wildcard at a time
  • thought: maybe do a thing where we really dig into a set of debugging?

@@ suggested procedure after syntax: first run with -j big and -k; then everything left will be blocking errors.

After the syntax errors: running your snakemake workflow

Here is a short list of tactics to use when trying to debug execution errors in your snakemake workflow -- that is, after you resolve any syntax errors preventing snakemake from reading the Snakefile.

  1. Run snakemake with -n/--dry-run, and inspect the output. This will tell you if the snakemake workflow will run the rules and produce the output you're actually interested in.
  2. Run snakemake with -j/--cores 1. This will run your jobs one after the other, in serial mode; this will make the output from snakemake jobs less confusing, because only one job will be running at a time.
  3. Run snakemake with -p/--printshellcmds. This will print out the actual shell commands that are being run.
  4. Run just the rules you're trying to debug by specifying either the rule name or a filename on the command line (see Running rules and choosing targets from the command line for more information).

Finding, fixing, and avoiding syntax errors.

Whitespace and indentation errors: finding, fixing, and avoiding them.

Use a good editor, e.g. vscode or some other text editor. Put it in snakemake mode or Python mode (spaces etc.)

Syntax errors, newlines, and quoting.

triple quotes vs single quotes

deleting lines.

Debugging Snakefile workflow declarations/specifications @@

MissingInputException

One of the most common errors to encounter when writing a new workflow is a MissingInputException. This is snakemake's way of saying three things: first, it has figured out that it needs a particular file; second, that file does not already exist; and third, it doesn't know how to make that file (i.e. there's no rule that produces that file).

For example, consider this very simple workflow file:

# expect_fail

rule example:
    input:
        "file-does-not-exist"

When we run it, we get:

MissingInputException in rule example in file /Users/t/dev/2023-snakemake-book-draft/code/examples/errors.simple-fail/snakefile.missing-input, line 1:
Missing input files for rule example:
    affected files:
        file-does-not-exist

This error comes up in two common situations: either there is an input file that you were supposed to provide the workflow but that is missing (e.g. a missing FASTQ file); or the rule that is supposed to produce this file (as an output) doesn't properly match.

MissingOutputException and increasing --latency-wait

Sometimes you will see an error message that mentions a MissingOutputException and suggests increasing the wait time with --latency-wait. This is most frequently a symptom of a rule that does not properly create an expected output file.

For example, consider:

# expect_fail

rule example:
    output:
       "file-does-not-exist"
    shell: """
       touch file-does-not-exist-typo
    """   

Here we have a simple rule whose output block specifies that it will create a file named file-does-not-exist, but (due to a typo in the shell command) creates the wrong file instead. If we run this, we will get the following message:

Waiting at most 5 seconds for missing files.
MissingOutputException in rule example in file /Users/t/dev/2023-snakemake-book-draft/code/examples/errors.simple-fail/snakefile.missing-output, line 3:
Job 0 completed successfully, but some output files are missing. Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
file-does-not-exist

First, let's remember that the output: block is simply an annotation, not a directive: it's telling snakemake what this rule is supposed to create, without actually creating it @@ (link to input-output here). The part of the rule that creates the file is typically the shell: block, and, here, we've made a mistake in the shell block, and are creating the wrong file.

There's no simple way for snakemake to know what files were actually created by a shell block, so snakemake doesn't try: it simple complains that we said running this rule would create a particular file, but it didn't create that file when we ran it. That's what MissingOutputException generally means.

To fix this, we need to look at the shell command and understand why it is not creating the desired file. That can get complicated, but one common fix is to avoid writing filenames redundantly and instead use {output} patterns in the shell block so that you don't accidentally use different names in the output: block and in the shell: block.

So then what is this message about waiting 5 seconds for missing files, and/or increasing --latency-wait? This refers to an advanced situation (discussed @@later) that can occur when we are writing to a shared network file system from jobs running on multiple machines. If you're running snakemake on a single machine, this should never be a problem! We'll defer discussion of this until later.

WorkflowError and wildcards

Another common error is a `WorfklowError: Target rules may not contain wildcards." This occurs when snakemake is asked to run a rule that contains wildcards.

Consider:

# expect_fail

rule example:
    input: "{name}.input"
    output: "{name}.output"
    shell: "cp {input} {output}"

which generates:

WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).

This error occurs in this case because there is only one rule in the snakemake workflow, and when werun snakemake it will default to running that rule as its target. However, that rule uses wildcards in its output block, and hence cannot be a target.

You can also encounter this error when you specify a rule name explicitly; if the rule you ask snakemake to run by name contains a wildcard in its output block, you can't run the rule directly - you have to give it a filename that snakemake can use to infer the wildcard.

In either case, the solution is to either ask snakemake to build a filename, or give snakemake a target that does not include wildcards. For example, if the file XYZ.input existed in the directory, here we could either specify XYZ.output on the command line, or we could write a new default rule that specified the name XYZ.output as a pseudo-target:

rule all:
    input:
        "XYZ.output"

Either solution has the effect of providing the rule example with a value to substitute for the wildcard name.

See Using wildcards to generalize your rules and Targets for more information.

Debugging running snakemake workflows

Run your rules once target at a time.

Run your rules one job at a time.

Finding and interpreting error messages

Display of error messages for failed commands

Running all the rules you can with -k/--keep-going

Snakemake has a slightly confusing presentation of error messages from shell commands: the messages appear above the notification that the rule failed

Consider the following Snakefile:

# expect_fail

rule hello_fail:
    shell: """
        ls file-does-not-exist
    """

When you run this in a directory that does not contain the file named file-does-not-exist, you will see the following output:

[Fri Apr 14 14:59:29 2023]                 
rule hello_fail:
    jobid: 0
    reason: Rules with neither input nor output files are always executed.
    resources: tmpdir=/var/folders/6s/_f373w1d6hdfjc2kjstq97s80000gp/T

ls: cannot access 'file-does-not-exist': No such file or directory
[Fri Apr 14 14:59:29 2023]
Error in rule hello_fail:
    jobid: 0
    shell:
        
        ls file-does-not-exist
    
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

There are three parts to this output:

  • the first part starts at rule hello_fail:, and declares that snakemake is going to run this rule, and gives the reason why.

  • the second part contains the error message from running that command - here, ls fails because the file in question does not exist, and so it prints out ls: cannot access 'file-does-not-exist': No such file or directory . This is the error output by the failed command.

  • the third part starts at "Error in rule hello_fail" and describes the rule that failed: its name hello_fail, its jobid, and the shell command that was run (ls file-does-not-exist), together with some information about how the failure was detected (a non-zero exit code @@) and how the shell command was run (in so-called "strict mode" @@).

The somewhat non-intuitive part here is that the error message that is specific to the failed rule - that the file in question did not exist - appears above the notification of failure.

There are some good reasons for this (@@ something to do about stdout capture) and various ways to change this behavior (@@ logging) but, by default, this is how snakemake reports errors in shell commands.

What this means in practice is that when you are debugging a failed shell command, the place to look for the snakemake error is above the notification of the failure!

@@ describe bash strict mode

@@ describe (briefly) logging

@@ when running with -j more than 1

Out of memory errors: "Killed".

CTB: is it lowercase or uppercase?

Sometimes you will see a "rule failed" @@ error from snakemake, and the only error message that you will be able to find is "killed". What is this?

This generally means that your shell command (or shell process) was terminated by an unavoidable signal from the operating system - and the most common such signal is an out-of-memory error.

When a process uses too much memory, the default behavior of the operating system is to immediately terminate it - there's not much else to be done. Unfortunately, the default error message explaining this is somewhat lacking.

There is no single way to fix this problem, unfortunately. A few general strategies include:

  • switching to a system with more memory, or (if you are using a queuing system like slurm) requesting more memory for your job.
  • if you are using a program that asks you to specify an amount of memory to use (e.g. some assemblers, or any java program), you can decrease the amount of memory you request on the command line.
  • you can also decrease the size of the dataset you are using, perhaps by subdividing it or sub-sampling @@.