Count Transcriptomes

Counting the number of transcriptomes in a metatranscriptome

In metagenomics, we use binning to approximate the number of genomes (or species) in a sample. Although we can apply taxonomic classification methods like sourmash to metatranscriptomic samples, these methods only work reliably when sequences already exist for the organism of interest. This is still a relatively rare occurrence in environmental metatranscriptomics.

We will make use of other data in our samples to approximate the number of species present. Although our samples were poly-A selected, inevitably a some ribosomal RNA is always sequenced given its abundance in samples. We can estimate the number of species in our metatranscriptome by counting the unique number of ribosomal protein sequences we see. There are many ribosomal proteins one could use for this analysis, and we provide a table of these at the bottom of this lesson. We will use one sequence here, however if we were to use more we would average the number of unique sequences we detect between the total sequences to estimate the number of transcriptomes in our sample.

We will first search for our protein in the amino acid sequences derived from our de novo assembly of our metatranscriptome. We will use Pfam domains and a tool called HMMER to help us locate all of the matching sequences. Then, we will extract our matches and filter those that are highly similar.

Below is a picture representation of the HMM profile we will be working with.

rpsG seq logo

Because we are only searching in our assembly, this method only captures the number of transcriptomes that assembled.

First let's make a new directory and link in our annotation file.

mkdir -p ${PROJECT}/count-transcriptomes
cd ${PROJECT}/count-transcriptomes

Then let's link in the peptide sequences (ORFs) predicted from our full transcriptome assembly.

ln -s /LUSTRE/bioinformatica_data/bioinformatica2018/assembly/tara_f135_megahit_full.transdecoder.pep .

Next let's link in an alignment of a Pfam domain for RpsG. In the future, you can download this alignment here

ln -s /LUSTRE/bioinformatica_data/bioinformatica2018/assembly/PF00177-full.sto .

Next we'll build a HMM profile of the Pfam domain.

hmmbuild PF00177-full.hmm PF00177-full.sto
hmmpress PF00177-full.hmm

We then use the HMM profile to search the proteins from our assembly

hmmscan -T 100 --tblout PF00177-full-tbl.txt --domtblout PF00177-full-domtbl.txt PF00177-full.hmm tara_f135_megahit_full.transdecoder.pep

Let's take a look at one of the files output by this search

less -S PF00177-full-tbl.txt

The third column contains the names of our protein sequences that matched these domains. We can use those names to extract our matches from our assembly

# Grab the names
cat PF00177-full-tbl.txt | Rscript -e 'writeLines(noquote(read.table("stdin", stringsAsFactors = F)$V3))' > PF00177-names.txt

# extract the matches
ln -s /LUSTRE/bioinformatica_data/bioinformatica2018/extract-hmmscan-matches.py .
python extract-hmmscan-matches.py PF00177-names.txt tara_f135_megahit_full.transdecoder.pep > PF00177.faa

Let's count the number of sequences that matched

grep ">" PF00177.faa | wc -l

Some matches are quite similar to each other. Let's cluster our sequences at 97% similarity and see how this changes the number of unique proteins we detect.

cd-hit -i PF00177.faa -o PF00177-c97.faa -c .97

Let's see how much clustering at 97% reduced our estimate!

grep ">" PF00177-c97.faa | wc -l

From this, we estimate there are 82 transcriptomes assembled from our metatranscriptome.

Given that this protein is highly conserved, we could BLAST these sequences against NCBI nr/nt database to see if there are any matches. We could then compare it to our sourmash results and see how much overlap vs. how much we are missing from each!

This technique can also be used to find any Pfam domain of interest. For instance, if you are interested in photosynthesis, you could extract all photosynthetic proteins by searching with the Pfam domains.

Other proteins to use for quantifcation

We used three proteins to quantify the number of transcriptomes in our sample. There are more that can be used, and they are list in the table below. They are originally derived from Carradec et al. 2018.

name	COG	PFAM
RpsG	COG0049	PF00177
RpsB	COG0052	PF00318
RplK	COG0080	PF03946
RplA	COG0081	PF00687
RplC	COG0087	PF00297
RplD	COG0088	PF00573
RplV	COG0091	PF00237
RpsC	COG0092	PF00189
RplN	COG0093	PF01929
RplE	COG0094	PF00281
RpsH	COG0096	PF00410
RplF	COG0097	NA
RpsE	COG0098	PF03719, PF00333
RpsM	COG0099	PF00416
RpsK	COG0100	PF00411
RplM	COG0102	PF00572
RpsI	COG0103	PF00380
RpsO	COG0184	NA
RpsS	COG0185	PF00203
RpsQ	COG0186	PF00366
GyrB	COG0187	PF00204
GyrA	COG0188	PF00521
RimK	COG0189	PF08443
FolD	COG0190	PF00763, PF02882
Fba	COG0191	NA
MetK	COG0192	PF01941
Pth	COG0193	PF01195
Gmk	COG0194	PF00625
RplO	COG0200	PF00827
RplR	COG0256	NA
RpsD	COG0522	PF00163