Assembling with PLASS
The basic idea with any (meta)transcriptome assembly is you feed in your reads and you get out a bunch of contigs that represent transcripts, or stretches of RNA present in the reads. You run a transcriptome assembly program using the adapter & k-mer trimmed reads as input and get out a pile of assembled RNA. We used MEGAHIT earlier to build this sort of assembly.
MEGAHIT and many other assemblers work in nucleotide space. This means that they find direct overlaps between the As, Ts, Cs, and Gs in the reads and use information about those overlaps to make contigs. Although this is a powerful method, it often fails when:
- There is sequencing errors
- There are repetitive regions
- There is strain variation
When assembly fails, contigs either break or don't assemble at all.
Given that nucleotide sequences are far more variable than protein sequences (third base pair wobble in strain variation in particular), assembling in amino acid space can overcome a lot of the difficulties encountered when assembling in nucleotide space.
PLASS is a new assembler that assembles in amino acid space. Unlike MEGAHIT, it does not have built in error correction and so it is best to adapter and k-mer trim the reads before using it. It sometimes performs better than nucleotide assemblers and so is good to test on samples to see what type of improvement it can give.
The contigs output by PLASS will represent protein sequences that come from the eukaryotic organisms found in each environmental sample.
Install PLASS
We already installed plass for you, but here's the installation command for future reference.
conda install plass
Link in the trimmed data
We will be using the same set of TARA oceans mRNAseq reads that we trimmed in the last lesson from Alberti et al., 2017.
Create a new folder assembly
to work in
cd $PROJECT
mkdir -p assembly_plass
cd assembly_plass
Link the khmer-trimmed data we prepared earlier in the newly created folder:
ln -fs ${PROJECT}/trim/*.khmer.fq.gz .
ls
Plass needs separate files for _1
and _2
, so let's split the files
for filename in *.khmer.fq.gz
do
#Use the program basename to remove _1.qc.fq.gz to generate the base
base=$(basename $filename .khmer.fq.gz)
echo $base
#Run khmer trimming
split-paired-reads.py --gzip ${base}.khmer.fq.gz -1 ${base}_1.khmer.fq.gz -2 ${base}_2.khmer.fq.gz
done
Run the assembler
Let's run an assembly:
plass assemble TARA_135_SRF_5-20_rep1_1m_1.khmer.fq.gz TARA_135_SRF_5-20_rep2_1m_1.khmer.fq.gz TARA_135_SRF_5-20_rep1_1m_2.khmer.fq.gz TARA_135_SRF_5-20_rep2_1m_2.khmer.fq.gz tara135_srf_plass.fasta --threads 2
PLASS is not as fast as megahit, but not as slow as some other assemblers.
The output assembly will be tara135_srf_plass.fasta
.
Looking at the assembly
Let's look at the beginning
head tara135_srf_plass.fasta
How is this assembly different from the megahit
transcripts?
Mapping to the PLASS Assembly
Set up the workspace
We'll be using Paladin
to map back to the plass assembly.
Paladin
solves the problem of mapping nucleotide sequences
back to amino acid sequences.
Let's make a directory to work in
cd $PROJECT
mkdir -p paladin_mapping
cd paladin_mapping
Link in the qc
trimmed reads
ln -s ${PROJECT}/trim/*qc.fq.gz ./
Index the Assembly:
paladin index -r3 tara135_srf_plass.fasta
Run the mapping:
paladin align -f 125 -t 2 tara135_srf_plass.fasta TARA_135_SRF_5-20_rep1_1m_1.khmer.fq.gz
This will output a SAM file with the names of the reads that mapped, as well as the amino acid sequences that have been translated into amino acid space.