The gene orthology and paralogy predictions are generated by a pipeline where maximum likelihood phylogenetic gene trees (generated by TreeBeST) play a central role. They aim to represent the evolutionary history of gene families, i.e. genes that diverged from a common ancestor. These gene trees reconciled with their species tree have their internal nodes annotated to distinguish duplication or speciation events.
This methodology was published in EnsemblCompara GeneTrees: Analysis of complete, duplication aware phylogenetic trees in vertebrates. Vilella AJ, Severin J, Ureta-Vidal A, Durbin R, Heng L, Birney E. Genome Research 2008 Nov 4. Please cite this publication if you use the homologies from Ensembl.
There is a clear concordance with reciprocal best approaches in the simple case of unique orthologous genes. However, the gene tree pipeline is able to find more complex one-to-many and many-to-many relations. This, for instance, significantly raises the number of Teleost (bony fish) to Mammal orthologs and has even more dramatic effects on Fly/Mammal or Worm/Mammal orthologous gene predictions. Using this approach, we are also able to "time" duplication events which produce paralogs by identifying the most recent common ancestor (=taxonomy level) for a given internal node of the tree.
The gene orthology and paralogy prediction pipeline has 6 basic steps:
* There is an extra optional step which calculates the dN/dS values for the orthology relationships of closely related pairs of species. We use codeml in the PAML4 package (model=0, NSsites=0) for this (ensembl-compara/scripts/homology/codeml.ctl.hash).
We only calculate these dN/dS values for high-coverage closely-related pairs of species. When the species evolutionary distance is too large, the saturation of the dS values biases the estimated dN/dS ratio. When the dS value of a given ortholog gene pair is too large, we mask out the dN/dS ratio, although the N, S, dN, dS and log-likelihood (lnL) values can still be obtained for all pairs.
** A description of the tree building method in TreeBeST: the CDS back-translated protein alignment (i.e., codon alignment) is used to build 5 different trees:
For (1) and (2), TreeBeST uses a modified version of phyml release 2.4.5 which takes an input species tree, and tries to build a gene tree that is consistent with the topology of the species tree. This "species-guided" phyml uses the original phyml tree-search code. However, the objective function maximised during the tree-search is multiplied by an extra likelihood factor not found in the original phyml. This extra likelihood factor reflects the number of duplications and losses inferred in a gene tree, given the topology of the species tree. The species-guided phyml allows the gene tree to have a topology that is inconsistent with the species tree if the alignment strongly supports this. The species tree is based on the NCBI taxonomy tree (subject to some modifications depending on new datasets).
The final tree is made by merging the five trees into one consensus tree using the "tree merging" algorithm. This allows TreeBeST to take advantage of the fact that DNA-based trees often are more accurate for closely related parts of trees and protein-based trees for distant relationships, and that a group of algorithms may outperform others under certain scenarios. The algorithm simultaneously merges the five input trees into a consensus tree. The consensus topology contains clades found in any of the input trees, where the clades chosen are those that minimize the number of duplications and losses inferred, and have the highest bootstrap support. Branch lengths are estimated for the final consensus tree based on the DNA alignment, using phyml with the HKY model.
A stable ID is assigned to each GeneTree. Please refer to the Family and GeneTree stable ID page for more info.
Using the Gene Trees, we can infer the following pairwise relationships.
any gene pairwise relation where the ancestor node is a speciation event. We predict several descriptions of orthologues:
any gene pairwise relation where the ancestor node is a duplication event. We predict several descriptions of paralogues:
Genes in different species and related by a speciation event are defined as orthologs. Depending on the number of genes found in each species, we differentiate among 1:1, 1:many and many:many relationships. Please, refer to the figure where there are examples of the three kinds.
A within_species_paralog corresponds to a relation between 2 genes of the same species where the ancestor node has been labelled as a duplication node. In the previous figure, Hsap2 and Hsap2', and Mmus2 and Mmus2' are two examples of within-species-paralogs. The duplication event relating the paralogues does not need to affect this species only. For example, Mmus2' and Mmus3' are also within_species_paralog but the duplication event has occurred in the common ancestor between species Hsap (human) and species Mmus (mouse). The taxonomy level "times" the duplication event to the ancestor of "Euarchontoglires".
When paralogous genes are too distant to be in the same gene tree, but can still be related as part of a broader "super-family", they are labelled as other_paralog relationships. In this case, the precise taxonomic level of the duplication event is left as undetermined.
A between_species_paralog corresponds to a relation between genes of different species where the ancestor node has been labelled as a duplication node e.g. Mmus1:Hsap2 or Mmus1:Hsap3. Currently, we do not annotate between_species_paralog, except in the following cases:
A paralog labelled as a gene_split is an artefactual type of paralogy. It is commonly realted to fragmented genome assemblies or a gene prediction that is poor in supporting evidences (cDNA, ESTs, proteins, etc.). A putative_gene_split is called when there is no (or little) overlap between the gene fragments in the same species and they lie in different sequence regions in the assembly (see GeneB1/GeneB2 in the image below). In contrast, a contiguous_gene_split is called when the two fragments lie close to each other (<1MB) in the same sequence region and the same strand (see GeneA1/GeneA2 below).