post-assembly-intro
Tutorial for common steps post denovo genome assembly
Github URL: https://github.com/chrishah/post-assembly-intro
Disclaimer To follow the demo and make the most of it, it helps if you have some basic skills with running software tools and manipulating files using the Unix shell command line. It assumes you have Singularity or Docker installed on your computer (tested with Docker version 18.09.7, build 2d0083d; on Ubuntu 18.04).
Introduction
Once you have assembled a genome you usually want to assess the quality based on certain metrics, such as contiguity, completeness, perhaps even evaluate if your assembly is contaminated with sequences from ‘non-target’ organisms.
In the following I’ll demonstrate a few common steps. The data required for these comes with the repository - it’s in the data/ directory. Ideally you would start by cloning/downloading this repository, e.g. using git, like so:
(user@host)-$ git clone https://github.com/chrishah/post-assembly-intro.git
Which will get you a directory that you can move into and follow along with the exercise within it.
(user@host)-$ cd post-assembly-intro
Checkout what we have in the data/ directory:
(user@host)-$ ls data
From the file names/extensions you might already have a good idea about what most of these files may be (that’s the intention). I’ll get to the details in little while.
Post Assembly
1.) Assess assembly contiguity
At this stage you have probably heard about the most common assembly stats for describing contiguity - N50, etc.
Quast is a very convenient tool for calculating these stats.
We have a toy genome assembly in your data/ directory - let’s run quast on it.
(user@host)-$ singularity exec docker://reslp/quast:5.0.2 \
quast.py data/genome_assembly.fasta
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in reslp/quast:5.0.2 \
quast.py data/genome_assembly.fasta
This produces a directory quast_results containing the stats. You can have a quick look at those in a simple text file.
(user@host)-$ cat quast_results/latest/report.txt
Or check out the html version of the report (quast_results/latest/report.html) which you can open in your web browser. If you are working on a remote server, you’ll have to download the html to your local computer first.
Quast gives a whole lot of other useful things but for these I refer you to the quast homepage.
2.) Assess assembly completeness
The idea here is that there exists a set of conserved genes that are expected to be present in the genomes of most organisms. Since they are conserved they should be reasonably easy to identify using some form of similarity search to reference sequences for these genes. The percentage of these conserved genes identified in the genome you have assembled is then used as a proxy for the overall completeness of the gene space in the genome.
There are currently two tools available:
The latter is unfortunately not maintained any more (some history), but it can still be used if you can get it installed on your system. I have made a docker container for it, since I am planning to keep using it - see here. Anyway, BUSCO is the ‘new kid’ (well, not so new any more) and works also very well.
2a.) BUSCO
Now, let’s run BUSCO (will take about 15 minutes):
(user@host)-$ singularity exec docker://ezlabgva/busco:v5.2.1_cv1 \
busco -i data/genome_assembly.fasta \
-o busco -m genome -l eukaryota \
-c 2 -f
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -v $(pwd):/in -w /in ezlabgva/busco:v5.2.1_cv1 \
busco -i data/genome_assembly.fasta \
-o busco -m genome -l eukaryota \
-c 2 -f
While it is running, feel free to open a new terminal window and already start exploring the outputs that we have precomputed for you in data/outputs/busco/.
If you ran BUSCO as above it will have created one directory called busco (because you said -o busco above). Note that if you run it a second time, it will complain and
A few words on the parameters used:
-i- input fasta file-o- a string that will determine the name of the output directory BUSCO will write to-m- run mode ‘genome’; there also is ‘transcriptome’ and ‘protein’-l- BUSCO provides sets of reference genes at particular taxonomic levels - see here for a full list.-c- use this number of CPUs-f- force overwrite results from previous run - if you had used the same name before
A detailed explanation of the parameters can be found on the BUSCO webpage and we’re also using BUSCO as part of different training sessions (for example this or this).
Usually, the most relevant files are:
busco/run_eukaryota_odb10/short_summary.txtbusco/run_eukaryota_odb10/full_table.tsv
and fasta files in the directory:
busco/run_eukaryota_odb10/busco_sequences/
If you want to know which reference sets are available busco also has an option to list those (note how I use singularity to call it now):
(user@host)-$ singularity exec docker://ezlabgva/busco:v5.3.2_cv1 busco --list-datasets
2b.) compleasm
A very recent addition to these kind of evaluation tools is compleasm and it claims to be ‘a faster and more accurate reimplementation of BUSCO’. Let’s see. In this case the developers already provide us with a Docker image. As discussed (hopefully) this could be used via Docker or Singularity/Apptainer.
So, we can either run it via Singularity:
(user@host)-$ singularity exec docker://huangnengcsu/compleasm:v0.2.6 \
compleasm run \
-a data/genome_assembly.fasta -l eukaryota -o compleasm -t 2
.. or via Docker:
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in huangnengcsu/compleasm:v0.2.6 \
compleasm run \
-a /in/data/genome_assembly.fasta -l eukaryota -o compleasm -t 2
While this is running (you’d need to open a new terminal window to do this), or in case you don’t want to run at this stage, we can look at the outputs that compleasm produces. Example results ship with the repository:
(user@host)-$ ls data/outputs/compleasm/
2c.) CEGMA
CEGMA, once installed, or containerized, is simple to run (it has a lot of options that you can explore in your own time) - takes a while though.
Caution
The next step (cegma) will run for about an hour, so if you are in a rush, you can also skip this and look at the output that we have deposited with the repo (see below).
(user@host)-$ singularity exec docker://chrishah/cegma:2.5 \
cegma --threads 2 -g data/genome_assembly.fasta
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in chrishah/cegma:2.5 \
cegma --threads 2 -g data/genome_assembly.fasta
While it is running we can skip to the next part and talk about mapping reads to genomes.
Once it’s done the thing you want to be looking at is the CEGMA report in output.completeness_report. A gff file with the genes cegma has predicted can be found at output.cegma.gff.
Note that if for some reason you want to skip running CEGMA we have an example output deposited for you as part of this repo at:
data/outputs/cegma/output.completeness_reportdata/outputs/cegma/output.cegma.gff
Let’s have a look at the completeness report:
(user@host)-$ head -n 30 data/outputs/cegma/output.completeness_report #look at the precomputed results
(user@host)-$ head -n 30 output.completeness_report #if you ran yourself
3.) Mapping reads to genomes
Read mapping is covered by many online tutorials, so I’ll just show you how it could be done with a tool called bowtie-2.
First index your genome file.
(user@host)-$ singularity exec docker://reslp/bowtie2:2.3.5 \
bowtie2-build data/genome_assembly.fasta my_genome.index -q
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in reslp/bowtie2:2.3.5 \
bowtie2-build data/genome_assembly.fasta my_genome.index -q
Check out which new files have been created.
(user@host)-$ ls -hlrt
Then, map the reads to the indexed genome (this will take a minute or so with nothing happening apparently on your screen - be patient).
(user@host)-$ singularity exec docker://reslp/bowtie2:2.3.5 \
bowtie2 -1 data/reads.1.fastq.gz -2 data/reads.2.fastq.gz \
--threads 2 -q --phred33 --fr -x my_genome.index -S my_mapped_reads.sam
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in reslp/bowtie2:2.3.5 \
bowtie2 -1 data/reads.1.fastq.gz -2 data/reads.2.fastq.gz \
--threads 2 -q --phred33 --fr -x my_genome.index -S my_mapped_reads.sam
This has produced a file called my_mapped_reads.sam. This is simple text file formatted in sam format, that contains information on where in the genome certain reads mapped (if at all).
You can look into the file, if you dare.. - just the first 1000 lines.
(user@host)-$ head -n 1000 my_mapped_reads.sam
Since SAM is just a text file and for large amounts of data these files may get very big the developers have established a binary (non-human readable) version of SAM, which is called BAM.
The next step will be to convert the SAM to a BAM file.
(user@host)-$ singularity exec docker://reslp/samtools:1.9 \
samtools view -bS my_mapped_reads.sam -o my_mapped_reads.bam -@ 2
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in reslp/samtools:1.9 \
samtools view -bS my_mapped_reads.sam -o my_mapped_reads.bam -@ 2
Check out the size of the newest file my_mapped_reads.bam as compared to the original SAM file. Note that we have not lost any information - we have just compressed the data.
(user@host)-$ ls -hlrt
Two more steps that are usually being done are sorting and indexing the bam file - this is the convention, and what most downstream tools expect.
(user@host)-$ singularity exec docker://reslp/samtools:1.9 \
samtools sort -o my_mapped_reads.sorted.bam my_mapped_reads.bam -@ 2
(user@host)-$ singularity exec docker://reslp/samtools:1.9 \
samtools index my_mapped_reads.sorted.bam -@ 2
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in reslp/samtools:1.9 \
samtools sort -o my_mapped_reads.sorted.bam my_mapped_reads.bam -@ 2
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in reslp/samtools:1.9 \
samtools index my_mapped_reads.sorted.bam -@ 2
A common step that I want to at least mention is the removal of duplicates. Picard offers a good option there.
(user@host)-$ singularity exec docker://broadinstitute/picard:2.20.6 \
java -jar /usr/picard/picard.jar MarkDuplicates \
INPUT=my_mapped_reads.sorted.bam OUTPUT=my_mapped_reads.sorted.duprmvd.bam \
METRICS_FILE=my_mapped_reads.sorted.duprmvd.metrics REMOVE_DUPLICATES=true \
ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in broadinstitute/picard:2.20.6 \
java -jar /usr/picard/picard.jar MarkDuplicates \
INPUT=my_mapped_reads.sorted.bam OUTPUT=my_mapped_reads.sorted.duprmvd.bam \
METRICS_FILE=my_mapped_reads.sorted.duprmvd.metrics REMOVE_DUPLICATES=true \
ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
I encourage you to inspect the assembly and reads mapping to it visually. A possible tool is the Integrative Genomics Viewer igv. You can install it at some point, but for now we’re going to use through a webapp - go here. Another stand alone program you can explore later is Tablet.
First we need to index our reference genome. Index the genome for viewing.
(user@host)-$ singularity exec docker://reslp/samtools:1.9 \
samtools faidx data/genome_assembly.fasta
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in reslp/samtools:1.9 \
samtools faidx data/genome_assembly.fasta
If you followed until here you’re going to need the following files downloaded locally:
data/genome_assembly.fasta(genome fasta; came with the repo)data/genome_assembly.fasta.fai(index produced in the last step above)my_mapped_reads.sorted.bam(sorted bam file from above) - alternative:data/outputs/read_mapping/my_mapped_reads.sorted.bamif you haven’t run yourselfmy_mapped_reads.sorted.bam.bai(index of sorted bam file from above) - alternativedata/outputs/read_mapping/my_mapped_reads.sorted.bam.baiif you haven’t run yourself
In the web app, go Genome->Local File->, then make sure to select both files, so genome_assembly.fasta together with genome_assembly.fasta.fai.
Now, to get the reads visible, go to Tracks->Local File and select my_mapped_reads.sorted.bam and my_mapped_reads.sorted.bam.bai again at the same time before clicking Open.
ATTENTION
Note that most of the contigs will not have reads mapping to it, because this is a reduced set of reads. Please use the search function in IGV to find
NODE_19055_length_5183_cov_61.240306for exploration.
Optional
IGV allows to also load GFF files, containing information on genome annotation. If you want to try that you can download the file
data/outputs/cegma/output.cegma.gffand add it as another track in IGV. If you do you’ll see that CEGMA predicted a gene on the contigNODE_19055_length_5183_cov_61.240306.
Enjoy!
4.) Blobtools
A nice tool for assessing contamination in your genome assembly is blobtools. It summarizes aggregate properties of the assembly (GC content, coverage of each contig/scaffold) which sometimes reveals interesting patterns in assemblies.
The basic things you need are
an assembly - fasta file
information about coverage
Tip
The most important outputs generated by the steps below are also deposited with the repository as backup here: data/outputs/blobtools/ - in case you’ll want to start exploring outputs right away, rather than actually running the commands below.
Blobtools can extract coverage information from bam files (gladly we made one above). It also can parse the needed information directly from the fasta headers, if you have used certain assemblers, e.g. Platanus or SPAdes. Our assembly was done with SPADes, so we can try that.
Blobtools needs to be run in three steps - do consult the manual on the Blobtools webpage to get more info on what the individual steps are doing.
(user@host)-$ singularity exec docker://chrishah/blobtools:v1.1.1 \
blobtools create -i data/genome_assembly.fasta -y spades -o blobtools_spades
(user@host)-$ singularity exec docker://chrishah/blobtools:v1.1.1 \
blobtools view -i blobtools_spades.blobDB.json
(user@host)-$ singularity exec docker://chrishah/blobtools:v1.1.1 \
blobtools plot -i blobtools_spades.blobDB.json
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in chrishah/blobtools:v1.1.1 \
blobtools create -i data/genome_assembly.fasta -y spades -o blobtools_spades
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in chrishah/blobtools:v1.1.1 \
blobtools view -i blobtools_spades.blobDB.json
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in chrishah/blobtools:v1.1.1 \
blobtools plot -i blobtools_spades.blobDB.json
The file you want to look at first of all is: blobtools_spades.blobDB.json.bestsum.phylum.p8.span.100.blobplot.spades.png, but there is lots more to explore on your own.
Now, let’s assume you hadn’t used SPAdes as your assembler, you can still use blobtools, but in this case you need to give the coverage information in a different way, e.g. a bam file.
(user@host)-$ singularity exec docker://chrishah/blobtools:v1.1.1 \
blobtools create -i data/genome_assembly.fasta -b my_mapped_reads.sorted.bam -o blobtools_bam
(user@host)-$ singularity exec docker://chrishah/blobtools:v1.1.1 \
blobtools view -i blobtools_bam.blobDB.json
(user@host)-$ singularity exec docker://chrishah/blobtools:v1.1.1 \
blobtools plot -i blobtools_bam.blobDB.json
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in chrishah/blobtools:v1.1.1 \
blobtools create -i data/genome_assembly.fasta -b my_mapped_reads.sorted.bam -o blobtools_bam
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in chrishah/blobtools:v1.1.1 \
blobtools view -i blobtools_bam.blobDB.json
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in chrishah/blobtools:v1.1.1 \
blobtools plot -i blobtools_bam.blobDB.json
Checkout blobtools_bam.blobDB.json.bestsum.phylum.p8.span.100.blobplot.bam0.png - there shouldn’t be much difference to the previous result. The second result is derived based on actual mapping results, so actually mapped reads and since I have not given you the full readset there is some differences with respect to the y-axis (coverage). The main purpose of this exercise is to demonstrate how to use blobtools if the genome you are working with had not been made with one of the supported assemblers.
Finally, one of the nicest features of blobtools is that the visualizations can be taxonomically annotated - see here. What you’ll need is a so-called ‘hits files’. This is essentially a text files obtained via comparing the assembly against a reference database using blast or other tools - see here.
Caution
The drop down below demonstrates how a hits file could be obtained using blast. Don’t do this as part of a course unless you’re asked to (this will download about 250 GB worth of data)
Create a hitsfile using blast (click text to see)
The following details how you could download the entirety of NCBI’s nt (nucleotide) database. Then compare your assembly against it using BLAST and use the info you get to annotate you blobs.
Attention
Do not do this as part of the course (if you are in one right now), unless you are specifically asked. The next steps will involve downloading (currently) some 200-300 GB worth of data and subsequently a BLAST search that might take several days, if not parallelized in a smart way.
So, for completeness sake, you could download the entire nt database and decompress it.
(user@host)-$ mkdir db
(user@host)-$ cd db
(user@host)-$ wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.*.tar.gz"
(user@host)-$ for a in nt.*.tar.gz; do tar xzf $a; done
(user@host)-$ cd ..
Then, you would use BLAST to compare your genome against the database - the blobtools people suggest a certain way to use blast for this. You could change the params however you see fit of course, but the main thing is that you get the output in the right format.
(user@host)-$ ASSEMBLY=data/genome_assembly.fasta
(user@host)-$ DB=db/nt
(user@host)-$ blastn \
-query $ASSEMBLY \
-db $DB \
-outfmt "6 qseqid staxids bitscore std" \
-max_target_seqs 1 \
-max_hsps 1 \
-evalue 1e-25 \
-num_threads 10 \
-out blastn.fmt6.out.txt
continue here, as part of the course
An example hits file as obtained through the blast search above comes with the repository - check it out.
(user@host)-$ cat data/blastn.fmt6.out.txt
Now, try to give this additional info to blobtools. Last thing you need is a so called taxdump, this is a textfile that contains the taxonomic information for all entries in the Genbank databases. You can download it and unpack.
(user@host)-$ wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
(user@host)-$ tar xzfv taxdump.tar.gz
From this you get a file nodes.dmp and a names.dmp file - these you need in the next step.
(user@host)-$ singularity exec docker://chrishah/blobtools:v1.1.1 \
blobtools create -i data/genome_assembly.fasta -y spades \
--nodes nodes.dmp --names names.dmp --hitsfile data/blastn.fmt6.out.txt -o blobtools_tax
(user@host)-$ singularity exec docker://chrishah/blobtools:v1.1.1 \
blobtools view -i blobtools_tax.blobDB.json
(user@host)-$ singularity exec docker://chrishah/blobtools:v1.1.1 \
blobtools plot -i blobtools_tax.blobDB.json
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in chrishah/blobtools:v1.1.1 \
blobtools create -i data/genome_assembly.fasta -y spades \
--nodes nodes.dmp --names names.dmp --hitsfile data/blastn.fmt6.out.txt -o blobtools_tax
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in chrishah/blobtools:v1.1.1 \
blobtools view -i blobtools_tax.blobDB.json
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in chrishah/blobtools:v1.1.1 \
blobtools plot -i blobtools_tax.blobDB.json
What you want to look at initially is:
blobtools_tax.blobDB.json.bestsum.phylum.p8.span.100.blobplot.bam0.png
Nice, no?
Now what to do with this info?
You could for example take all reads that contributed to contigs that were classified as ‘Chordata’, and reassemble them, if that happens to be your target. Blobtools brings all the information it gathered throughout the process together in text file. Have a look.
(user@host)-$ less blobtools_tax.blobDB.table.txt
We could get all contig/scaffold ids that were classified as Chordata and write this list to a new file.
(user@host)-$ grep "Chordata" blobtools_tax.blobDB.table.txt | cut -f 1 > Chordata.list.txt
Then, use another tool from the blobtools suite (see here) to extract the relevant reads from the original bam file. Note that if you did not run all commands above you can still try the below by adjusting the paths to the input files to point to the precomputed results in data/outputs/read_mapping/my_mapped_reads.sorted.bam
(user@host)-$ singularity exec docker://chrishah/blobtools:v1.1.1 \
blobtools bamfilter -b my_mapped_reads.sorted.bam -i Chordata.list.txt \
--read_format fq --noninterleaved -o reads_Chordata
using Docker (click text, if hidden)
(user@host)-$ docker run --rm -u $(id -u):$(id -g) -v $(pwd):/in -w /in chrishah/blobtools:v1.1.1 \
blobtools bamfilter -b my_mapped_reads.sorted.bam -i Chordata.list.txt \
--read_format fq --noninterleaved -o reads_Chordata
Thanks for joining us today!
If you have any questions, comments, feedback (good OR bad), let me know!
What do you think?
Contact
Christoph Hahn - christoph.hahn@uni-graz.at