Assembling a Metatranscriptome

Learning objectives:

What is metatranscriptome assembly?
How do assemblers work?
Checking the quality of assembly
Understanding metatranscriptome assembly

The basic idea with any transcriptome assembly is you feed in your reads and you get out a bunch of contigs that represent transcripts, or stretches of RNA present in the reads that don't have any long repeats or much significant polymorphism. You run a transcriptome assembly program using the trimmed reads as input and get out a pile of assembled RNA.

These contigs will represent transcripts that come from the eukaryotic organisms (poly-A mRNAseq reads) found in each environmental sample.

Install Megahit

We already installed megahit for you in setup, but here's the installation command for future reference.

conda install megahit

Link in the trimmed data

We will be using the same set of TARA oceans mRNAseq reads that we trimmed in the last lesson from Alberti et al., 2017.

Create a new folder assembly to work in

cd $PROJECT
mkdir -p assembly
cd assembly

Link the khmer-trimmed data we prepared earlier in the newly created folder:

ln -fs ${PROJECT}/khmer_trim/*.khmer.pe.fq.gz .
ls

Run the assembler

Let's run an assembly:

time megahit --12 TARA_135_SRF_5-20_rep1_1m.khmer.pe.fq.gz,TARA_135_SRF_5-20_rep2_1m.khmer.pe.fq.gz  --memory 8e9 --num-cpu-threads 2 --out-prefix TARA_135_SRF --out-dir ./TARA_135_SRF_khmer -f

This will take about 10 minutes; at the end you should see output like this:

--- [STAT] 11733 contigs, total 5202861 bp, min 200 bp, max 4235 bp, avg 443 bp, N50 465 bp
--- [Wed Nov  7 02:13:12 2018] ALL DONE. Time elapsed: 431.097547 seconds ---

The output assembly will be TARA_135_SRF_khmer/TARA_135_SRF.contigs.fa.

Looking at the assembly

First, let's copy the assembly into our current directory:

cp ./TARA_135_SRF_khmer/*contigs.fa tara135_SRF_megahit.fasta

Now, look at the beginning:

head tara135_SRF_megahit.fasta

These are the transcripts! Yay!

What can we do with an assembly?

Why would we do an assembly? What advantages does an assembly have over the reads? And what can we do with this assembly?

assembly squashes redundant reads, so the assembly should have approximately one sequence per transcript, as opposed to the reads, which many have many reads per transcript.
assembled contigs are longer than reads, so it is easier to do gene search on them. We'll cover this tomorrow!
assemblies also have fewer errors than the reads do, so sample comparisons and so on may be more accurate. However, assembly also may eliminate some of the data if it's really low coverage, and abundance information is lost as well.

Further Reference

There are other asssemblers you can try on your data, some of which are listed on our References page. One that we are excited about is PLASS, which does assembly at the protein level!