Annotating a Metatranscriptome
Our assembly contains a set of contigs that represent transcripts, or fragments of transcripts. To get at the functional content of these transcripts, we must first find the most likely open reading frames (ORFs)
Annotating transcriptomes and metatranscriptomes with dammit
dammit!
dammit is an annotation
pipeline written by Camille Scott. dammit runs a relatively standard annotation
protocol for transcriptomes: it begins by building gene models with Transdecoder,
and then uses the following protein databases as evidence for annotation:
Pfam-A, Rfam, OrthoDB,
uniref90 (uniref is optional with --full
).
If a protein dataset is available, this can also be supplied to the
dammit
pipeline with --user-databases
as optional evidence for
annotation.
In addition, BUSCO v3 is run, which will compare the gene content in your transcriptome
with a lineage-specific data set. The output is a proportion of your
transcriptome that matches with the data set, which can be used as an
estimate of the completeness of your transcriptome based on evolutionary
expectation (Simho et al. 2015).
There are several lineage-specific datasets available from the authors
of BUSCO. We will use the metazoa
dataset for this transcriptome.
Installation
Annotation necessarily requires a lot of software! dammit attempts to simplify this and
make it as reliable as possible, and now conda makes it even easier. We've already
installed dammit
int eh tara
environment, but if you need to install it in the future,
here's the command:
conda install dammit
Database Preparation
dammit has two major subcommands: dammit databases
and dammit annotate
. databases
checks that the databases are installed and prepared, and if run with the --install
flag,
will perform that installation and preparation. If you just run dammit databases
on its
own, you should get a notification that some database tasks are not up-to-date -- we need
to install them!
We're going to run a quick
version of the pipeline, add a parameter, --quick
, to omit OrthoDB, Uniref, Pfam, and Rfam. In the future, running full
run would take longer to install and run, but you'll have access to the full annotation pipeline.
export DAMMIT_DB_DIR=/LUSTRE/bioinformatica_data/bioinformatica2018/dammit_databases
dammit databases --install --busco-group eukaryota --quick
Note: the dammit databases can be quite large, so make sure you have a lot of space for them. Don't put them in your home directory on a cluster, for example!
We used the "metazoa" BUSCO group. We can use any of the BUSCO databases, so long as we install
them with the dammit databases
subcommand. You can see the whole list by running
dammit databases -h
. You should try to match your species as closely as possible for the best
results. If we want to install another, for example:
dammit databases --install --busco-group metazoa --quick
Note: By default, dammit
installs databases in the home
directory. However, when you have limited space, as we do here, we can choose to install the databases in another location (e.g. /LUSTRE/bioinformatica_data/bioinformatica2018/dammit_databases
)
Annotating your metatranscriptome
Keep things organized! Let's make a project directory:
Make sure you still have the PROJECT
variable:
echo $PROJECT
If you don't see any output, set the PROJECT
variable again:
export PROJECT=~/work
Now let's make a directory for annotation
cd $PROJECT
mkdir -p annotation
cd annotation
We ran megahit earlier to generate an assembly. Let's link that assembly to this directory
ln -s $PROJECT/assembly/tara135_SRF_megahit.fasta ./
Make sure you run ls
and see the assembly file.
Just annotate it, Dammit!
dammit annotate tara135_SRF_megahit.fasta --busco-group eukaryota --n_threads 6
While dammit runs, it will print out which tasks its running to the terminal. dammit is written with a library called pydoit, which is a python workflow library similar to GNU Make. This not only helps organize the underlying workflow, but also means that if we interrupt it, it will properly resume!
After a successful run, you'll have a new directory called tara135_SRF_megahit.fasta.dammit
. If you
look inside, you'll see a lot of files:
ls tara135_SRF_megahit.fasta.dammit/
The most important files for you are tara135_SRF_megahit.fasta.dammit.fasta
,
tara135_SRF_megahit.fasta.dammit.gff3
, and tara135_SRF_megahit.fasta.dammit.stats.json
.
If the above dammit
command is run again, there will be a message:
**Pipeline is already completed!**