Comparing Metatranscriptome samples
Often times, the goal of a (meta)transcriptome sequencing project is to compare the differences in functional profiles between samples. One way to do this is with differential expression. To do differential expression, we de novo assemble a metatranscriptome, quantify the number of reads that "map" back to the metatranscriptome, and use those counts to find differences. However, assembly can be a long process, and often we want to get to know our data a bit before we launch into that process.
We can use k-mer profiles of the reads to compare samples using sourmash compare
.
First, let's make a directory that we will be working in:
mkdir -p ${PROJECT}/sourmash-compare
cd ${PROJECT}/sourmash-compare
Let's link in our trimmed reads.
ln -s $PROJECT/trim/*1.qc.fq.gz .
ln -s $PROJECT/trim/*2.qc.fq.gz .
Now we can calculate signatures for each of the files. This will take 5 or 10 minutes to run
for infile in *1.qc.fq.gz
do
j=$(basename ${infile} 1.qc.fq.gz)
sourmash compute -k 31 --scaled 10000 --track-abundance --merge ${j} -o ${j}.sig ${j}2.qc.fq.gz ${infile}
done
Using these signatures, we can compare our samples.
sourmash compare -k 31 -o tara-trimmed.comp *sig
Now let's plot! Sourmash has a built in plot utility that we can take advantage of. The output is a heatmap.
sourmash plot --labels tara-trimmed.comp
This produces a file, tara-trimmed.comp.matrix.png
, that contains a
similarity matrix.
You can see the heatmap here
We can also use the output of sourmash compare to calculate an MDS plot. Let's
rerun sourmash compare
, this time saving the output in csv format.
sourmash compare -k 31 --csv tara-trimmed.comp.csv *sig
We can use this output to make a Multidimensional Scaling plot. MDS plots are commonly used in transcriptome workflows to visualize distance between samples. Here the strength is we used all of our reads to calculate these distances.
To make an MDS plot, run:
ln -s /LUSTRE/bioinformatica_data/bioinformatica2018/scripts/mds_plot.R .
Rscript mds_plot.R tara-trimmed.comp.csv tara-trimmed-comp-mds.pdf
The script source is here if you are interested!
This outputs a file tara-trimmed-comp-mds.pdf
. You can see what
that visualization looks like
here.
We see that our samples cluster by site (TARA_135
vs TARA_136
) and
then by depth (SRF for surface vs DCM for deep cholorophyl maximum).
Throughout this lesson we have been working with raw reads.
Raw reads contain a lot of errors, and these errors are included in the
signatures. Next we will learn to k-mer trim our reads, and then re-run
compare
to see if makes a difference!