Functional annotation databases

This repository contains the materials for the “Databases in ecology and comparative genomics course”: day 3. It deals with functional annotation and relevant resources to attribute function and summarize the metabolism inferred from genomic and transcriptomic experiments

Github URL:https://github.com/claudioametrano/functional_annotation_db

Software required (on remote server)

ncbi datasets
Prokka
Eggnog

Software required (locally)

a fasta file reader (MEGA, Aliview, Jalview, Bioedit …)
Terminal (Linux and Mac) or Mobaxterm (Win) to ssh into the remote server and a client (e.g. Filezilla) for easy file transfer

BEFORE WE START

$ ssh username@l2.gsc1.uni-graz.at

$ srun --mem=4G --ntasks=4 --cpus-per-task=1 --time=10:00:00 --pty bash

Download this repository:

$ git clone https://github.com/claudioametrano/functional_annotation_db.git

Rename the folder containing the results, so you won’t overwrite it running the analyses of this tutorial, and create a new results folder

$ cd functional_annotation_db/
$ mv results results_backup  
$ mkdir results

Introduction

Genomic workflow: a recap

workflow from Jung et al. 2020modif.

In brief:

Sample -> DNA extraction
Library preparation
Sequencing/base-calling
Raw reads QC and filtering
De novo assembly (or mapping on reference)
Polishing and scaffolding
Structural annotation -> non-coding and coding region: gene prediction (ab initio + evidence-based)
functional annotation
Visualization and submission

What do we mean by genome annotation?

The term “genome annotation” is typically used to refer to two discrete bioinformatics tasks. The first of these is structural annotation, namely the prediction of the genomic intervals comprised of functional genome features: genes, transcripts associated with those genes, the exons comprising those transcripts, and for protein-coding transcripts their coding sequences (CDS) and untranslated regions (UTRs). The second is functional annotation, namely assigning gene symbols (names) and putative functions to the structurally identified gene models, in order to associate biological information to genes and corresponding proteins. We will focus here on functional annotation, displaying relevant database and software resources to assign amino-acid sequences to a putative function.

func. annot. from Del Angel et al., 2018

Some of the method are not that different from the one you applied in day 1 and 2 (e.g. BLAST), what makes the difference here is the completeness (in term of phylogenetic breadth) and accuracy of the annotations stored in databases, also in term of delimited orthogroups.

Database and tools for functional annotation

Many tool and database with annotated genes/genes families are available to perform functional annotation. Also many of these databases have their own integrated tool to perform annotation (e.g. EggNOG and EggNOG mapper) and web-servers for quick search, but:

They do not necessarily “speak the same language”: different gene identifiers, organization and hierarchy of the database and targeted feature (e.g. full coding sequence, domains)
They do not cover necessarily the same set of genes
Lack of annotation for a relevant fraction of genes outside core metabolism (in non-model organisms)

Despite the plethora of different accession number standards adopted by different databases, there is a good interchange among databases, which also report the accession and the classification of other relevant databases for the same protein.

Relevant resources

Database (tool)	Number entries	URL	Notes
UniProtKB	573 230 Swiss-Prot (reviewed); 252 188 522 TrEMBL proteins	uniprot	Sequence records split into manually-curated and automatically annotated sets; cross-links to nearly every other DB
InterPro (InterProScan)	≈ 48 k protein signatures	interpro	hmm profiles; GO terms present
Pfam	24 424 protein families	now hosted by Interpro	HMM families grouped into structural “clans”
Cluster Of Orthologs	5,050 COGs from 2,103 bacterial and 193 archaeal species	COG	mostly for Bacteria and Archea
Gene Ontology (e.g. AmiGO; BLAST2GO)	185 k controlled terms (BP/MF/CC)	GO	Standard vocabulary for functional enrichment; Structured ontology
KEGG (e.g. BlastKOALA)	27 663 K numbers; 580 pathway maps	kegg	Gold-standard metabolic & signaling network ;Genes grouped into KO orthologs; KO nodes populate manually drawn pathway graphs
BioCyc and MetaCyc	3 153 pathways; 19 020 reactions; 19 372 metabolites	metacyc	Metabolic pathways; metabolic modelling
eggNOG (eggNOG mapper)	17 032 907 orthologous groups across 12 535 genomes	eggnog	Hierarchical orthology clusters with functional tags
OrthoDB v12	162 M genes from 5 827 euks + 18 158 proks + 7 962 viruses	www.orthodb	Ortholog groups at multiple phylogenetic depths; Supplies lineage datasets for BUSCO
Protein Data Bank (PBD)	~240000 structures; ~1M computed structures	pdb	Since 1971 the reference for protein structure, and one of the oldest biological DB
AlphaFold (AlphaFold Server/API)	> 214 M predicted 3-D structures	alphafold	Structure-based; function clues for “hypothetical” proteins
Carbohydrate-Active enZYmes Database (CAZY)		Cazy	Specialized database for enzymes that catalyze reaction involving, oligo- and polysaccharides
Reactome	2 742 pathways (15 492 reactions)	reactome	Curated pathways; human

from here

1. Wandering through databases

Let’s try some of the annotation resources with this not yet annotated protein that could be one of the output of our structural annotation workflow

protein1 MKAISRVLIAMVAAIAALFTSTGTSHAGLDNELSLVDGQDRTLTVQQWDTFLNGVFPLDRNRLTREWFHSGRAKYIVAGPGADEFEGTLELGYQIGFPWSLGVGINFSYTTPNILIDDGDITAPPFGLNSVITPNLFPGVSISADLGNGPGIQEVATFSVDVSGAEGGVAVSNAHGTVTGAAGGVLLRPFARLIASTGDSVTTYGEPWNMN

Try to copy/paste the sequecne into:

UniProt -> Peptide search

BLAST for similar sequence within the DB
Functional profile by GO

Interpro

Protein family (Pfam), domains and their components( e.g. secondary structure beta-sandwich)

EggNOG -> Sequence search

Identify Orthologs in other taxa
deliver a taxonomic profile of taxa having that orthologous sequences
Functional profile that integrates GO, KEGG, and Domains (Pfam, SMART)

NOTE

WHAT IS A PROTEIN DOMAN: It is a distinct part of a protein’s overall chain that often carries out a specific function. Think of a domain as a modular “sub-unit”. It can fold > into a stable 3-D structure on its own. They’re frequently swapped or duplicated during evolution, giving rise to multi-domain proteins with new combinations of capabilities.

TASK 1

compare the 3-D structure showed in UniProt for the previously submitted protein, with the one generated by Alphafold server Questions

Do the reference structure (X-ray) match with the one you obtain running Alphafold?

How many monomers has the Uniprot 3D structure?

Can you try to re-run alphafold with a sequence that could approach the final quaternary structure of the oligomerized protein?

*** NOTE ***

X-ray protein crystallography: existed long before computational models were available, and it is still a fundamental aid for accurate protein structure. It is a technique by which it is possible to determine the three dimensional positions of each atom in a protein. Now over 100 years old, x-ray crystallography was first used to determine the three dimensional structures of inorganic materials, then small organic molecules, and finally macromolecules like DNA and proteins. full text here

2. How to summarize the overwhelming amount of information of functional annotation: pathway analysis and Gene Ontology (GO)

2.1 Pathway analysis

To understand how genes and molecules (gene products and metabolites) are networked to describe a known biological phenotype.

Reduces complexity of information (from thousands of genes to hundreds of pathways).
Pathway enrichment analysis, which generates a list of over-represented pathways in a gene list.

KEGG pathways: Kyoto Encycplopedia of Genes and Genomes

Collection of manually curated pathways maps, includes relation among: Genes, proteins (enzymes, …), Compounds (substrates), pathways (what pathways are connected to this particular gene?) Includes information about: Metabolism, Genetic information processing, Cellular processes (e.g. transport), Organismal systems (e.g. Immune system), Diseases, Drug development kegg

TCA cycle

MetaCyc and BioCyc

A collecion of computationally predicted (and curated) metabolic pathways and operons of a single organism (or multiple: MetaCyc)

Cellular overview summarizing the complete metabolic map
Metabolism comparison between organisms
Overlay with other omics data (e.g. transcriptomics, pathway enrichment)
Organisms classified according to the level of curation (3 tiers)
Integrated genome browser Cellular metabolism of S. cerevisiae

2.2 Gene Ontology

Genes can be analyzed at different levels:

Single gene (sequence, function)
Group of genes, possibly framed into pathways (KEGG, MetaCyc)
Hierarchically grouping genes using their features

What problems GO try to solve? Vast amount of biological data, different name/term for the same concepts (e.g. translation = protein synthesis) and difficult comparison among species -> need for standardized descriptive terms -> Gene Ontology.

GO: describes the functions of gene products, using a dictionary of allowed words, using standardized vocabulary terms (and codes) that are usable both by humans and computers. These terms are in relation each other in a directed acyclic graph and they are divided into three species-independent categories (ontologies):

Biological processes
Cellular components
Molecular function

GOgraph

3. Annotation: a case study using cyanobacteria

Why Cyanobacteria:

Compact genomes
Simpler gene structure
High metabolic diversity

Strain (NCBI RefSeq)	Lifestyle & trait highlights	Genome size
Synechocystis sp. PCC 6803* (GCF_000009725.1)	Facultative heterotroph, freshwater	3.6 Mb
Prochlorococcus marinus MED4 (GCF_000011465.1)	Ultra-oligotrophic, marine	1.66 Mb
Nostoc sp. PCC 7120 (GCF_000009705.1)	Filamentous; freshwater	7.2 Mb

3.1 Assembly download (Already available in `data/`)

$ singularity pull https://depot.galaxyproject.org/singularity/ncbi-datasets-cli:14.26.0
$ singularity exec ncbi-datasets-cli:14.26.0 datasets download genome accession GCF_000009725.1 GCF_000011465.1 GCF_000009705.1 --include genome

$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/725/GCF_000009725.1_ASM972v1/GCF_000009725.1_ASM972v1_genomic.fna.gz
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/465/GCF_000011465.1_ASM1146v1/GCF_000011465.1_ASM1146v1_genomic.fna.gz
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/705/GCF_000009705.1_ASM970v1/GCF_000009705.1_ASM970v1_genomic.fna.gz

3.2 Prokka: rapid prokaryotic genome annotation

Prokariotic genome assembly are the easiest, fastest to annotate, given their simpler gene structure and compact genomes. Prokka Github

The following image is rather big (~800 Mb; Prokka runs many software, such as Prodigal, therefore the image also contains many dependencies), if you have it already available elsewhere, copy it here to the main folder of the repository and skip the next command

$ singularity pull https://depot.galaxyproject.org/singularity/prokka:1.14.6--pl5321hdfd78af_5

$ singularity exec prokka\:1.14.6--pl5321hdfd78af_5 prokka --outdir results/Synechocystis_GCF_000009725.1 --prefix Synechocystis --cpu 1 ./data/GCF_000009725.1_ASM972v1_genomic.fna 

run the same for the other two samples (adjust names!) …or loop, as manual repeated steps are prone to mistakes! NO big deal with three samples though.

Runtime example: Synechocystis Prokka runtime: 12.32 minutes on 16 cores of Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

3.2.1 Prokka output

To make sense of the results two things are always useful when you run complex pipelines

Check manual to see if output is described, spoiler: it is!
Follow the main steps (tools adopted) of the pipeline to understand what it does to produce such outputs, see the following task…

TASK2

Looking at Prokka stdout (printed to screen, or at one of the samples’s logfile) try to identify the what software the pipeline uses for different steps (structural annotation steps (CDS, tNRA, rRNA), functional annotation etc.). If there is some software you do not know, search for its manual or Github page to roughly understand its function within Prokka.

3.2.2. Functional annotation in Prokka

Open one of the report

$ less results/Synechocystis_GCF_000009725.1/Synechocystis.tsv 

Locus_tag: name of the feature in Prokka

ftype: type of feature

lenghth_bp

gene: gene name as convetional abbreviation. If multiple annotated they are followed by _1 _2 etc.

EC_number: (Enzyme Commission), classification according to Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB). here or here

For example: EC 1.1.1.1 → Alcohol dehydrogenase

Class 1 → Oxidoreductases (enzymes catalyzing oxidation-reduction reactions)
Subclass 1 → Acting on the CH-OH group of donors
Sub-subclass 1 → With NAD⁺ or NADP⁺ as acceptor
Serial number 1 → First enzyme in this category

COG: Cluster of Ortholog classification

product: description of the protein

Many sequences identified ad CDS are not annotated, but this is normal. Annotation can be improved if providing Prokka with more extensive references.

TASK 3

Not only proteins are annotated: in Prokka output identify annotated tRNA. How many in the 3 genome tested? What is their theoretical maxium number/How many did you expect?

3.3 Functional annotation with EggNOG mapper

Prokka provide some basic, mostly bacterial-oriented, functional annotation, let’s try to complete the annotation with the support of EggNOG mapper and its database. EggNOG-mapper is a tool for fast functional annotation of novel sequences, being based on orthology predictions it should permits a higher precision than traditional homology searches (i.e. BLAST searches), avoiding transferring annotations from paralogs.

How it works: input (genome, CDS or protein) -> (Optional) gene prediction (Prodigal) -> Search (diamond or mmseqs or hmmer) -> attribution to orthologous group -> Functional annotation (emapper) with various database (KEGG, GO, pfam) eggnog

Also this container is quite large, maybe use the copy in the shared folder (if any)

$ singularity pull https://depot.galaxyproject.org/singularity/eggnog-mapper:2.1.12--pyhdfd78af_0

Set up folder and input file (Prodigal/Prokka .faa fasta: ouptut of the structural annotation)

$ mkdir results/eggnog_annotation
$ cp ./results/Nostoc_GCF_000009705.1/Nostoc.faa ./results/eggnog_annotation/
$ cp ./results/Synechocystis_GCF_000009725.1/Synechocystis.faa ./results/eggnog_annotation/
$ cp ./results/Prochlorococcus_GCF_000011465.1/Prochlorococcus.faa ./results/eggnog_annotation/

…or use the convenient command find

find -type f -name *.faa -exec cp {} results/eggnog_annotation/ \;

List of file and folders you will need (Diamond): eggnog.taxa.db, eggnog.taxa.db.traverse.pkl, eggnog.db, eggnog_proteins.dmnd)

Download the appropriate database
Orthologs from specific lineages are available here , using taxid code (e.g. Cyanobacteria = 1117).

$ singularity exec --bind $(pwd):/usr/local/lib/python3.11/site-packages/data eggnog-mapper\:2.1.12--pyhdfd78af_0   download_eggnog_data.py -y

the folder /usr/local/… is linked to our working directory because the script need to put the database there (it will show up in our current directory). -y skips the interactive question answering yes to all, so we can run singularity exec effectively

Lauch the mapper

Caution

This command is memory intensive and can run for long

singularity exec –bind $(pwd):/usr/local/lib/python3.11/site-packages/data eggnog-mapper:2.1.12–pyhdfd78af_0 emapper.py –cpu 1 -i results/eggnog_annotation/Nostoc.faa –itype proteins –go_evidence all -o nostoc –output_dir results/eggnog_annotation/ –excel -d bact –tax_scope 1117 –override

eggnog.db is the previously downloaded database (in this case the Cyanobacteria orthologs)

EggNOG output:  emapper.annotations, emapper.hits, emapper.seed_orthologs
(optionally annotation also in .xlsx)
```bash
$ less -S results/eggnog_annotation/nostoc.emapper.annotations

3.4 Visualize and summarize results

Hierarchic output such as those from GO term functional annotation can be represented to perform visual comparisons, for example using Krona. Even if usually adopted for taxonomic data (also hierarchically organized), Krona plot can be adapted to show, for example GO frequencies

Krona plot visualization

Installed krona and its dependencies in a dedicated conda environment (It should already exist)

# create conda environment and install dependancies
$ conda create -n krona python=3.12.2
$ conda activate krona
$ conda install -c conda-forge python-wget=3.2
$ conda install -c bioconda krona=2.8.1 goatools=1.2.3
$ conda install -c conda-forge pandas=2.3.0 openpyxl=3.1.5

Check if the environment exists

$ conda info --envs

If “krona” conda environments is already there then…

$ conda activate krona

Get the GO term db

$ wget http://current.geneontology.org/ontology/go-basic.obo

Launch this script for each of the eggnog xlsx output files

$ python egg2krona_go.py ./results/eggnog_annotation/synechocystis.emapper.annotations.xlsx --obo go-basic.obo --out ./results/eggnog_annotation/krona_go_synecho.html

3.5 Use GO and pathways

TASK4

A naive example of GO term analysis

Parsing the annotation files:

A. Verify if some of these Cyanobacteria is capable Nitrogen fixation:

Search in QuickGO the GO term which identify “Nitrogen fixation”
What proteome, among the three Cyanobacteria has the highest frequency of genes that code for proteins involved in this Biological Process? Which of them is then a nitrogen fixer?
Observe how also the other two genomes has this GO term (with a much lower frequency)

B. Verify if all these Cyanobacteria have Phycobilisomes (protein complexes containing accessory pigments)

4 Enrichment analysis application: Transcriptomics

In addition to their descriptive use in comparative genomics (between different species/taxa), a common real-world use of these cellular function summaries is to identify differences between experimental conditions (for the same organism). The enrichment analysis of GO terms (or pathways) performed on differentially expressed genes (significantly Up or Down regulated) in transcriptomics studies, helps to summarize and reveal which parts of cellular metabolism responded to a specific condition (e.g. temperature changes, drug exposure, pollutant stress), in comparison to the control sample.

Example of enrichment analysis with g:Profiler, provide a list of differentially expressed genes, and obtain which GO and pathway are significantly enriched. gprofiler

Functional annotation databases

Software required (on remote server)

Software required (locally)

BEFORE WE START

Introduction

Genomic workflow: a recap

What do we mean by genome annotation?

Database and tools for functional annotation

Relevant resources

1. Wandering through databases

TASK 1

2. How to summarize the overwhelming amount of information of functional annotation: pathway analysis and Gene Ontology (GO)

2.1 Pathway analysis

** KEGG pathways: Kyoto Encycplopedia of Genes and Genomes**

MetaCyc and BioCyc

2.2 Gene Ontology

3. Annotation: a case study using cyanobacteria

3.1 Assembly download (Already available in data/)

3.2 Prokka: rapid prokaryotic genome annotation

3.2.1 Prokka output

TASK2

3.2.2. Functional annotation in Prokka

TASK 3

3.3 Functional annotation with EggNOG mapper

3.4 Visualize and summarize results

Krona plot visualization

3.5 Use GO and pathways

TASK4

4 Enrichment analysis application: Transcriptomics

KEGG pathways: Kyoto Encycplopedia of Genes and Genomes

3.1 Assembly download (Already available in `data/`)