How well did our assembly work?

Set up the directory

First, make sure you have the PROJECT variable defined:

echo $PROJECT

if you don't see any output, make sure to redefine the $PROJECT variable.

Now, let's create a folder evaluation to work in:

cd $PROJECT
mkdir -p evaluation
cd evaluation

Now, let's copy in the assembly we made in the previous lesson:

cp $PROJECT/assembly/tara135_SRF_megahit.fasta ./

Generate assembly statistics with Transrate

Transrate is a program that can be used for a couple different types of assembly evaluation. The first and simplest method is to calculate length-based metrics about the assembly, such as the total number of bases, and the N50 of the contigs. Transrate can also be used to compare two assemblies or give you a score which represents proportion of input reads that provide positive support for the assembly. For a further explanation of metrics and how to run the reference-based transrate, see the documentation and the paper by Smith-Unna et al. 2016.

We have installed transrate for you. However, if you need to install it in the future, see installation instructions here.

See options for running transrate:

transrate -h

Let's use transrate to calculate some stats on our assembly contigs:

transrate --assembly tara135_SRF_megahit.fasta

You should see output that looks like this:

[ INFO] 2018-11-06 23:50:35 : Loading assembly: /LUSTRE/bioinformatica_data/bioinformatica2018/assembly/tara135_SRF_megahit.fasta
[ INFO] 2018-11-06 23:50:35 : Analysing assembly: /LUSTRE/bioinformatica_data/bioinformatica2018/assembly/tara135_SRF_megahit.fasta
[ INFO] 2018-11-06 23:50:35 : Results will be saved in /LUSTRE/bioinformatica_data/bioinformatica2018/assembly/transrate_results/tara135_SRF_megahit
[ INFO] 2018-11-06 23:50:35 : Calculating contig metrics...
[ INFO] 2018-11-06 23:50:35 : Contig metrics:
[ INFO] 2018-11-06 23:50:35 : -----------------------------------
[ INFO] 2018-11-06 23:50:35 : n seqs                         1502
[ INFO] 2018-11-06 23:50:35 : smallest                        200
[ INFO] 2018-11-06 23:50:35 : largest                        4998
[ INFO] 2018-11-06 23:50:35 : n bases                      638347
[ INFO] 2018-11-06 23:50:35 : mean len                      425.0
[ INFO] 2018-11-06 23:50:35 : n under 200                       0
[ INFO] 2018-11-06 23:50:35 : n over 1k                        40
[ INFO] 2018-11-06 23:50:35 : n over 10k                        0
[ INFO] 2018-11-06 23:50:35 : n with orf                      331
[ INFO] 2018-11-06 23:50:35 : mean orf percent              83.54
[ INFO] 2018-11-06 23:50:35 : n90                             232
[ INFO] 2018-11-06 23:50:35 : n70                             360
[ INFO] 2018-11-06 23:50:35 : n50                             453
[ INFO] 2018-11-06 23:50:35 : n30                             599
[ INFO] 2018-11-06 23:50:35 : n10                             935
[ INFO] 2018-11-06 23:50:35 : gc                             0.51
[ INFO] 2018-11-06 23:50:35 : bases n                           0
[ INFO] 2018-11-06 23:50:35 : proportion n                    0.0
[ INFO] 2018-11-06 23:50:35 : Contig metrics done in 0 seconds
[ INFO] 2018-11-06 23:50:35 : No reads provided, skipping read diagnostics
[ INFO] 2018-11-06 23:50:35 : No reference provided, skipping comparative diagnostics
[ INFO] 2018-11-06 23:50:35 : Writing contig metrics for each contig to /LUSTRE/bioinformatica_data/bioinformatica2018/assembly/transrate_results/tara135_SRF_megahit/contigs.csv
[ INFO] 2018-11-06 23:50:35 : Writing analysis results to assemblies.csv

Comparing Assemblies

We built a metatranscriptome with the full set of TARA_SRF reads. Copy this into your evaluation directory

cd ${PROJECT}/evaluation
ln -s LUSTRE/bioinformatica_data/bioinformatica2018/assembly/tara_f135_full_megahit.fasta

How do the two transcriptomes compare with each other?

# full vs. subset
transrate --reference=tara_f135_full_megahit.fasta --assembly=tara135_SRF_megahit.fasta --output=full_v_subset

# subset vs. full
transrate --assembly=tara_f135_full_megahit.fasta --reference=tara135_SRF_megahit.fasta --output=subset_v_full

How well does this assembly represent our sequenced reads?

It's useful to know how well the transcripts represent the sequenced reads. To do this, we'll need to link in the reads we used to generate this assembly:

ln -s ${PROJECT}/trim/TARA_135_SRF_5-20_*.qc.fq.gz ./

Quantifying reads with Salmon

We will use Salmon to quantify expression. Salmon is a new breed of software for quantifying RNAseq reads that is both really fast and takes transcript length into consideration (Patro et al. 2017).

You can read more about salmon-like "pseudoalignment" here:

Intro blog post: http://robpatro.com/blog/?p=248
A 2016 blog post evaluating and comparing methods here
Salmon github repo here
2018 paper: A direct comparison of genome alignment and transcriptome pseudoalignment

Quantification with Salmon

Check that salmon is available and see run options:

salmon -h

Now let's check that we still have the trimmed data we created day 1:

set -u
printf "\nMy trimmed data is in $PROJECT/trim/, and consists of $(ls -1 ${PROJECT}/trim/*.qc.fq.gz | wc -l) files\n\n"
set +u

where set -u should let you know if you have any unset variables, i.e. if the $PROJECT variable is not defined.

If you see -bash: PROJECT: unbound variable, then you need to set the $PROJECT variable.

export PROJECT= ~/work

and then re-run the printf code block.

NOTE: if you do not have files, please rerun quality trimming steps here

Create a directory to work in

   cd ${PROJECT}
   mkdir -p quant
   cd quant

Link an assembly

We use a full assembly to use for mapping. This assembly was made with all TARA_135 SRF reads, rather than the subset we used in the assembly tutorial.

ln -s /LUSTRE/bioinformatica_data/bioinformatica2018/assembly/tara_f135_full_megahit.fasta ./

Run Salmon

Build an index for the metatranscriptome:

    salmon index --index tara135 --transcripts tara_f135_full_megahit.fasta

Next, link in the QC reads (produced in quality:

   ln -s ../trim/*.qc.fq.gz ./

Then, run the salmon mapping:

for sample in *1.qc.fq.gz
do
  base=$(basename $sample _1.qc.fq.gz)
  echo $base
  salmon quant -i tara135 -p 2 -l A -1 ${base}_1.qc.fq.gz -2 ${base}_2.qc.fq.gz -o ${base}_quant
done

This will create a bunch of directories named something like TARA_135_SRF_5-20_rep1_quant, containing a bunch of files. Take a look at what files there are:

You should see:

   aux_info
   cmd_info.json
   lib_format_counts.json
   libParams
   logs
   quant.sf

The two most interesting files are quant.sf, which contains the counts, and salmon_quant.log (in the logs directory), which contains a log from the salmon run.

Looking at count data

The quant.sf files contain the relevant information about contig expression -- take a look

   head -20 TARA_135_SRF_5-20_rep1_1m_quant/quant.sf

You should see output that looks like this:

Name    Length  EffectiveLength TPM NumReads
k119_5  212 63.757  0.000000    0.000
k119_10 231 75.730  0.000000    0.000
k119_14 261 97.683  0.000000    0.000
k119_16 301 130.690 11.736728   1.000
k119_18 302 131.560 0.000000    0.000
k119_21 203 58.357  0.000000    0.000
k119_22 308 136.790 0.000000    0.000

The first column contains the transcript names, and the fifth column is the "raw counts", which is what many differential expression programs need.

Assess Mapping rate

Okay, we got some numbers for how much the transcripts were expressed, but how well did the reads map overall? Let's go look at one of the log files to see.

less TARA_135_SRF_5-20_rep1_1m_quant/logs/salmon_quant.log

How well did the reads map to the metatranscriptome?

Let's see how the rest of the files mapped. We can look at the mapping rate in each log file by executing the following:

grep Mapping *quant/logs/*

grep is a handy program that finds and prints lines in files that match a pattern. In this case, the pattern we're searching for is the word 'Mapping', and we're searching in any directory that ends in quant and has a logs directory with at least one file in it.

How do the mapping rates compare? What does this tell us about our metatranscriptome?

How well does quantification capture sample distances?

We made an MDS plot of our sourmash compare results yesterday. With our salmon quant.sf files, we can also make and MDS plot and see how different the two are. This will demonstrate how well mapping to our assembly captures the information in our reads.

We already ran the code to do this, but if you want to see what it looks like, you can find it here. It relies on the quant.sf files output by Salmon.

sourmash salmon

Look how different they are! To be fair, we only mapped back to a full transcriptome with 1 million reads, but this is still a good test.

What might this say about our samples?

Another way to assess read mapping

Transrate actually has a read assessment mode that uses salmon to "align" reads to the transcriptome and generates some metrics on read mapping.

We won't run this today, but you could run it via:

transrate --assembly=tara135_SRF_megahit.fasta --threads=2 --left=*_1.qc.fq.gz --right *_2.qc.fq.gz --output=${PROJECT}/evaluation/tara135_SRF_transrate