This document refers to the schema of the EnsEMBL Compara version 26.1. This is the revision 0.3 of this document.
This part is dedicated to DNA-DNA alignment. It includes alignment made with different alignment tools. It also includes a list of syntenic regions.
This part contains several types of data.
The first one corresponds to a list of protein orthologues. By now, for each gene, only the longest peptide is considered for homology search. Only pairwise comparisons are taken into account. In principle, only Best Reciprocal Hits (BRH) are considered. Nevertheless extra orthologous relationships are found taking into account the gene order conservation of neighbour genes (RHS, Reciprocal Hit based on Synteny around BRH). Additionally, in Human vs. Chimp relationships may be found Derived from Whole Genome Alignment (DWGA).
Another part is dedicated to protein clusters (or families). In this case, Metazoan entries of the SwissProt/SPTrEMBL database are also taken into account in order to annotate the clusters.
A third part is planed for domain alignments but is not used by now.
Contains configuration variables.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
meta_id | int unsigned | PRI | NULL | internal unique ID | ||
meta_key | varchar(40) | MUL | ||||
meta_value | varchar(255) | MUL |
Current values are:
mysql> SELECT * FROM meta; +---------+----------------------+------------+ | meta_id | meta_key | meta_value | +---------+----------------------+------------+ | 1 | max_alignment_length | 108838 | +---------+----------------------+------------+
max_alignment_length is used in the perl API GenomicAlignAdaptor.pm to speed up the data query.
Contains all taxa used in this database.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
taxon_id | int(10) unsigned | PRI | 0 | unique ID | ||
genus | varchar(50) | YES | MUL | NULL | e.g. Homo | |
species | varchar(50) | YES | NULL | e.g. sapiens | ||
sub_species | varchar(50) | YES | NULL | |||
common_name | varchar(100) | YES | MUL | NULL | e.g. Human | |
classification | mediumtext | YES | NULL | Full taxonomic classification |
E.g. the rows
mysql> SELECT * FROM taxon WHERE taxon_id IN (9031, 9606); +----------+--------+---------+-------------+-------------+-··· | taxon_id | genus | species | sub_species | common_name | +----------+--------+---------+-------------+-------------+-··· | 9031 | Gallus | gallus | NULL | Chicken | | 9606 | Homo | sapiens | NULL | Human | +----------+--------+---------+-------------+-------------+-··· ···-+-------------------------------------------------------------------··· | classification ···-+-------------------------------------------------------------------··· | gallus Gallus Phasianinae Phasianidae Galliformes Neognathae Aves | sapiens Homo Hominidae Catarrhini Primates Eutheria Mammalia ···-+-------------------------------------------------------------------··· ···-------------------------------------------------------------------------+ | ···-------------------------------------------------------------------------+ Archosauria Euteleostomi Vertebrata Craniata Chordata Metazoa Eukaryota | Euteleostomi Vertebrata Craniata Chordata Metazoa Eukaryota | ···-------------------------------------------------------------------------+
correspond to the Chicken and the Human species.
Contains information about the version of the genome assemblies used in this database.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
genome_db_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
taxon_id | int(10) unsigned | 0 | external reference to taxon.taxon_id | |||
name | varchar(40) | MUL | name of the species | |||
assembly | varchar(100) | assembly version of the genome | ||||
assembly_default | tinyint(1) | YES | 1 | boolean value describing if this assembly is the default one or not, so that we can handle more than one assempbly version for a given species. | ||
genebuild | varchar(100) | version of the genebuild | ||||
locator | varchar(255) | YES | NULL | used for production purposes or for user configuration in in-house installation. |
Eg the rows:
mysql> SELECT * FROM genome_db WHERE genome_db_id IN (1, 11); +--------------+----------+-------------------------+------------+-··· | genome_db_id | taxon_id | name | assembly | +--------------+----------+-------------------------+------------+-··· | 1 | 9606 | Homo sapiens | NCBI34 | | 11 | 9031 | Gallus gallus | WASHUC1 | +--------------+----------+-------------------------+------------+-··· ···-+------------------+---------------+---------+ | assembly_default | genebuild | locator | ···-+------------------+---------------+---------+ | 1 | 0405Ensembl | NULL | | 1 | 0406Ensembl | NULL | ···-+------------------+---------------+---------+
correspond to the Human and Chicken genomes
Contains the list of alignment methods used to find links (homologies) between DNA sequences.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
method_link_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
type | varchar(50) | MUL | the common name of the linking method between species. |
Current values are:
mysql> SELECT * FROM method_link; +----------------+----------------------+ | method_link_id | type | +----------------+----------------------+ | 1 | BLASTZ_NET | | 2 | BLASTZ_NET_TIGHT | | 3 | BLASTZ_RECIP_NET | | 4 | PHUSION_BLASTN | | 5 | PHUSION_BLASTN_TIGHT | | 6 | TRANSLATED_BLAT | | 7 | BLASTZ_GROUP | | 8 | BLASTZ_GROUP_TIGHT | | 101 | SYNTENY | | 201 | ENSEMBL_ORTHOLOGUES | | 202 | ENSEMBL_PARALOGUES | | 301 | FAMILY | +----------------+----------------------+
This table contains information about the comparisons stored in the database. A given method_link_species_set_id exist for each comparison made and relates a method_link.method_link_id with a set of genome_db.genome_db_id.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
method_link_species_set_id | int(10) unsigned | MUL | NULL | auto_increment | internal id. | |
method_link_id | int(10) unsigned | YES | MUL | NULL | external reference to method_link.method_link_id | |
genome_db_id | int(10) unsigned | YES | NULL | external reference to genome_db.genome_db_id |
E.g. the rows
mysql> SELECT * FROM method_link_species_set WHERE method_link_species_set_id = 71; +----------------------------+----------------+--------------+ | method_link_species_set_id | method_link_id | genome_db_id | +----------------------------+----------------+--------------+ | 71 | 1 | 1 | | 71 | 1 | 11 | +----------------------------+----------------+--------------+
mean that BLASTZ_NET (method_link_id = 1) has been used for linking all the species of this set: Human (genome_db_id = 1) and Chicken (genome_db_id = 11).
This table defines the genomic sequences used in the comparative genomics analyisis. It is used by the genomic_align_block table to define aligned sequences. It is also used by the dnafrag_region table to define syntenic regions.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
dnafrag_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
length | int(11) | 0 | ||||
name | varchar(40) | MUL | name of the DNA sequence (e.g., the name of the chromosome) | |||
genome_db_id | int(10) unsigned | 0 | external reference to genome_db.genome_db_id | |||
coord_system_name | varchar(40) | YES | NULL | refers to the coord system in which this dnafrag has been defined |
E.g. the row
mysql> SELECT * FROM dnafrag WHERE dnafrag_id IN (19, 23227); +------------+-----------+------+--------------+-------------------+ | dnafrag_id | length | name | genome_db_id | coord_system_name | +------------+-----------+------+--------------+-------------------+ | 19 | 105311216 | 14 | 1 | chromosome | | 23227 | 56310377 | 5 | 11 | chromosome | +------------+-----------+------+--------------+-------------------+
refer to the chromosome 14 of the Human genome (genome_db.genome_db_id = 1 refers to Human genome in this example) which is 105311216 nucleotides length and to the chromosome 5 of the Chicken genome (genome_db.genome_db_id = 11 refers to Chicken genome in this example) which is 56310377 nucleotides length.
This table is the key table for the genomic alignments. The software used to align the genomic blocks is refered as an external key to the method_link table. Nevertheless, actual aligned sequences are defined in the genomic_align table.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
genomic_align_block_id | bigint(20) unsigned | PRI | NULL | auto_increment | internal unique ID | |
method_link_species_set_id | int(10) unsigned | MUL | 0 | external reference to method_link_species_set.method_link_species_set_id | ||
score | double | YES | NULL | score returned by the homology search program | ||
perc_id | tinyint(3) unsigned | YES | NULL | Used for pairwise comparison. Defines the percentage of identity between both sequences | ||
length | int(10) | YES | NULL | total length of the alignment |
E.g. the row
mysql> SELECT * FROM genomic_align_block WHERE genomic_align_block_id = 5095912; +------------------------+----------------------------+-------+---------+--------+ | genomic_align_block_id | method_link_species_set_id | score | perc_id | length | +------------------------+----------------------------+-------+---------+--------+ | 5095912 | 71 | 1102 | 60 | 53 | +------------------------+----------------------------+-------+---------+--------+
will refer to a BLASTZ_NET alignment between human and chicken genomes (method_link_species_set.method_link_species_set_id = 71) with a score of 1102, an identity of 60% and a length of 53 nucleotides. The actual sequences corresponding to this aligment are defined in the genomic_align table.
This table contains the coordinates and all the information needed to rebuild genomic alignments. Every entry corresponds to one of the aligned sequences. It also contains an external key to the method_link_species_set which refers to the software and set of species used for getting the corresponding alignment. The aligned sequence is defined by an external reference to the dnafrag table, the starting and ending position within this dnafrag, the strand and a cigar_line.
The original aligned sequence is not stored but it can be retrieved using the cigar_line field and the original sequence. The cigar line defines the sequence of matches/mismatches and deletions (or gaps). For example, this cigar line 2MD3M2D2M will mean that the alignment contains 2 matches/mismatches, 1 deletion (number 1 is omitted in order to save some space), 3 matches/mismatches, 2 deletions and 2 matches/mismatches. If the original sequence is:
The aligned sequence will be:
M | M | D | M | M | M | D | D | M | M |
---|---|---|---|---|---|---|---|---|---|
A | A | - | C | G | C | - | - | T | T |
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
genomic_align_id | bigint(20) unsigned | PRI | NULL | auto_increment | unique internal id | |
genomic_align_block_id | bigint(10) unsigned | MUL | 0 | external reference to genomic_align_block.genomic_align_block_id | ||
method_link_species_set_id | int(10) unsigned | 0 | external reference to method_link_species_set.method_link_species_set_id. This information is redundant because it also appears in the genomic_align_block table but it is used to speed up the queries | |||
dnafrag_id | int(10) unsigned | MUL | 0 | external reference to dnafrag.dnafrag_id | ||
dnafrag_start | int(10) | 0 | starting position within the dnafrag defined by dnafrag_id | |||
dnafrag_end | int(10) | 0 | ending position within the dnafrag defined by dnafrag_id | |||
dnafrag_strand | tinyint(4) | 0 | strand in the dnafrag defined by dnafrag_id | |||
cigar_line | mediumtext | YES | NULL | internal description of the aligned sequence | ||
level_id | tinyint(2) | 0 | level of orhologous layer. 1 corresponds to the first layer of orthologous sequences found, 2 and over are addiotional layers. Use for building the syntenies (based on level_id = 1 only) |
E.g. the rows
mysql> SELECT * FROM genomic_align WHERE genomic_align_block_id = 5095912; +------------------+------------------------+----------------------------+------------+-··· | genomic_align_id | genomic_align_block_id | method_link_species_set_id | dnafrag_id | +------------------+------------------------+----------------------------+------------+-··· | 10191817 | 5095912 | 71 | 23227 | | 10191830 | 5095912 | 71 | 19 | +------------------+------------------------+----------------------------+------------+-··· ···-+---------------+-------------+----------------+------------+----------+ | dnafrag_start | dnafrag_end | dnafrag_strand | cigar_line | level_id | ···-+---------------+-------------+----------------+------------+----------+ | 30021505 | 30021554 | 1 | 37M3D13M | 2 | | 25833506 | 25833558 | -1 | 53M | 2 | ···-+---------------+-------------+----------------+------------+----------+
correspond to both pieces of sequences included in the alignment described above (see genomic_align_block table description). The first sequence includes the nucleotides from 30021505 to 30021554 in the forwards strand of the chromosome 5 of the Chicken genome (dnafrag.dnafrag_id = 23227). The second sequence includes the nucleotides from 25833506 to 25833558 in the backwards strand of the chromosome 14 of the Human genome (dnafrag.dnafrag_id = 19).
The aligned sequences can be rebuild using the original sequences fetched from the corresponding core databases and the cigar lines as explained before.
This table is used to group alignments.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
group_id | bigint(20) unsigned | MUL | NULL | auto_increment | internal ID | |
type | varchar(40) | This field allow us to group genomic_aligns in several ways (types) | ||||
genomic_align_id | bigint(20) unsigned | MUL | 0 | external reference to genomic_align.genomic_align_id |
E.g. the rows
mysql> SELECT * FROM genomic_align_group WHERE group_id = 64065; +----------+---------+------------------+ | group_id | type | genomic_align_id | +----------+---------+------------------+ | 64065 | default | 1261387 | | 64065 | default | 1261382 | | 64065 | default | 1261429 | | 64065 | default | 1261422 | | 64065 | default | 8250806 | | 64065 | default | 8250793 | | 64065 | default | 10191710 | | 64065 | default | 10191697 | | 64065 | default | 10191750 | | 64065 | default | 10191737 | | 64065 | default | 10191789 | | 64065 | default | 10191778 | | 64065 | default | 10191830 | | 64065 | default | 10191817 | | 64065 | default | 10191868 | | 64065 | default | 10191856 | | 64065 | default | 10191909 | | 64065 | default | 10191896 | | 64065 | default | 11109668 | | 64065 | default | 11109655 | | 64065 | default | 14443871 | | 64065 | default | 14443860 | | 64065 | default | 19780097 | | 64065 | default | 19780088 | | 64065 | default | 22515872 | | 64065 | default | 22515866 | | 64065 | default | 23205949 | | 64065 | default | 23205943 | | 64065 | default | 31071852 | | 64065 | default | 31071843 | | 64065 | default | 54076401 | | 64065 | default | 54076402 | +----------+---------+------------------+
correspond to the group of several genomic_align in which the alignment described before is included.
Contains all the syntenic relationships found and the relative orientation of both syntenic regions.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
synteny_region_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
method_link_species_set_id | int(10) unsigned | 0 | external reference to method_link_species_set.method_link_species_set_id. | |||
rel_orientation | tinyint(1) | 1 | 1 when both regions are in the same orientation, 0 otherwise. |
E.g. the row
mysql> SELECT * FROM synteny_region WHERE synteny_region_id = 1849; +-------------------+-----------------+ | synteny_region_id | rel_orientation | +-------------------+-----------------+ | 1849 | 1 | +-------------------+-----------------+
means that the syntenic region 1849 corresponds to a syntenic relationship where both genomic regions are in the same orientation. See dnafrag_region table for more details.
Contains the genomic regions corresponding to every syntenic relationship found. There are two genomic regions for every syntenic relationship.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
synteny_region_id | int(10) unsigned | PRI | 0 | external reference to synteny_region.synteny_region_id | ||
dnafrag_id | int(10) unsigned | PRI | 0 | external reference to dnafrag.dnafrag_id | ||
dnafrag_start | int(10) unsigned | 0 | first nucleotide from this dnafrag which is in synteny | |||
dnafrag_end | int(10) unsigned | 0 | last nucleotide from this dnafrag which is in synteny |
E.g. the rows
mysql> SELECT * FROM dnafrag_region WHERE synteny_region_id = 1849; +-------------------+------------+---------------+--------------+ | synteny_region_id | dnafrag_id | dnafrag_start | dnafrag_end | +-------------------+------------+---------------+--------------+ | 1849 | 19 | 23939018 | 37861860 | | 1849 | 23227 | 29431279 | 34526887 | +-------------------+------------+---------------+--------------+
correspond to both genomic regions of the syntenic region 1849. In this case, the first genomic region corresponds to the sequence from 23939018 to 37861860 of the chromosome 14 of the Human genome (dnafrag_id = 19 for this chromosome) and the second one corresponds to the sequence from 29431279 to 34526887 of the chromosome 5 of the Chicken genome (dnafrag_id = 23227 for this chromosome). Using the synteny_region table, we know that both syntenic regions are in the same orientation.
This table links sequences to the EnsEMBL core DB or to external DBs.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
member_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
stable_id | varchar(40) | EnsEMBL stable ID or external ID (for Uniprot/SWISSPROT and Uniprot/SPTREMBL) | ||||
version | int(10) | YES | 0 | version of the stable ID (see EnsEMBL core DB) | ||
source_name | varchar(40) | describe the source of the member (Uniprot/SWISSPROT, Uniprot/SPTREMBL, ENSEMBLGENE, ENSEMBLPEP) | ||||
taxon_id | int(10) unsigned | 0 | external reference to taxon.taxon_id | |||
genome_db_id | int(10) unsigned | YES | NULL | external reference to genome_db.genome_db_id | ||
sequence_id | int(10) unsigned | MUL | NULL | external reference to sequence.sequence_id. May be 0 when the sequence is not available in the sequence table, e.g. for a gene instance. |
||
gene_member_id | int(10) unsigned | MUL | NULL | external reference to member.memebr_id to allow linkage from peptides to genes. | ||
description | varchar(255) | YES | NULL | the description of the protein | ||
chr_name | varchar(40) | YES | NULL | chromosome where this sequence is located | ||
chr_start | int(10) | YES | NULL | first nucleotide of this chromosome which corresponds to this member | ||
chr_end | int(10) | YES | NULL | last nucleotide of this chromosome which corresponds to this member | ||
chr_strand | tinyint(1) | 0 | strand of the chromosome in which the member is |
E.g. the row
mysql> SELECT * FROM member WHERE member_id = 1945; +-----------+--------------------+---------+-------------+----------+--------------+-------------+-··· | member_id | stable_id | version | source_name | taxon_id | genome_db_id | sequence_id | +-----------+--------------------+---------+-------------+----------+--------------+-------------+-··· | 1945 | ENSGALP00000012979 | 1 | ENSEMBLPEP | 9031 | 11 | 1121 | +-----------+--------------------+---------+-------------+----------+--------------+-------------+-··· ···-+----------------+-··· | gene_member_id | ···-+----------------+-··· | 1932 | ···-+----------------+-··· ···-+---------------------------------------------------------------------------------------+-··· | description | ···-+---------------------------------------------------------------------------------------+-··· | Transcript:ENSGALT00000012994 Gene:ENSGALG00000008004 Chr:1 Start:8903413 End:8906606 | ···-+---------------------------------------------------------------------------------------+-··· ···-+----------+-----------+---------+------------+ | chr_name | chr_start | chr_end | chr_strand | ···-+----------+-----------+---------+------------+ | 1 | 8903413 | 8906606 | -1 | ···-+----------+-----------+---------+------------+
refers to the chicken (taxon_id = 9031 or genome_db_id = 11) gene ENSGALP00000012979 which is located in the chromosome 1 (from 8903413 to 8906606, in the reverse strand). This gene is described as "Transcript:ENSGALT00000012994 Gene:ENSGALG00000008004 Chr:1 Start:8903413 End:8906606" and the sequence can be found in the sequence table.
This table contains the protein sequences present in the member table used in the protein alignment part of the EnsEMBL Compara DB.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
sequence_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
sequence | longtext | YES | NULL | the sequence | ||
length | int(10) | YES | NULL | the length of the sequence |
E.g. the row
mysql> select * from sequence where sequence_id = 1121; +-------------+-------------------------------------------+--------+ | sequence_id | sequence | length | +-------------+-------------------------------------------+--------+ | 1121 | LPNTRGYTQVWECSLAVLIAMVCMTLVGWGLIWLFSVTASV | 41 | +-------------+-------------------------------------------+--------+
contains a 41 aminoacids long sequence.
This table is mainly used for production purposes.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
analysis_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
created | datetime | 0000-00-00 00:00:00 | date to distinguish newer and older versions off the same analysis. Not well maintained so far. | |||
logic_name | varchar(40) | UNI | string to identify the analysis. Used mainly inside pipeline. | |||
db | varchar(120) | YES | NULL | db should be a database name, db version the version of that db | ||
db_version | varchar(40) | YES | NULL | |||
db_file | varchar(120) | YES | NULL | the file system location of that database, probably wiser to generate from just db and configurations | ||
program | varchar(80) | YES | NULL | The binary used to create a feature. Similar semantic to above | ||
program_version | varchar(40) | YES | NULL | |||
program_file | varchar(80) | YES | NULL | |||
parameters | varchar(255) | YES | NULL | a parameter string which is processed by the perl module | ||
module | varchar(80) | YES | NULL | Perl module names (RunnableDBS usually) executing this analysis | ||
module_version | varchar(40) | YES | NULL | |||
gff_source | varchar(40) | YES | NULL | how to make a gff dump from features with this analysis | ||
gff_feature | varchar(40) | YES | NULL |
This table has been added to comply with core Bio::EnsEMBL::DBSQL::AnalysisAdaptor requirements.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
analysis_id | int(10) unsigned | MUL | 0 | external reference to analysis.analysis_id | ||
description | text | YES | NULL | |||
display_label | varchar(255) | YES | NULL |
This tables stores the raw HSP local alignment results of peptide to peptide alignments returned by a BLAST run it is translated from a FeaturePair object.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
peptide_align_feature_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
qmember_id | int(10) unsigned | MUL | 0 | external reference to member.member_id for the query peptide | ||
hmember_id | int(10) unsigned | MUL | 0 | external reference to member.member_id for the hit peptide | ||
qgenome_db_id | int(10) unsigned | 0 | external reference to genome_db.genome_db_id for the query peptide (for query optimization) | |||
hgenome_db_id | int(10) unsigned | 0 | external reference to genome_db.genome_db_id for the hit peptide (for query optimization) | |||
analysis_id | int(10) unsigned | 0 | external reference to analysis.analyis_id | |||
qstart | int(10) | 0 | starting position in the query peptide sequence | |||
qend | int(10) | 0 | ending position in the query peptide sequence | |||
hstart | int(10) | 0 | starting position in the hit peptide sequence | |||
hend | int(10) | 0 | ending position in the hit peptide sequence | |||
score | double(16,4) | 0.0000 | blast score for this HSP | |||
evalue | varchar(20) | YES | NULL | blast evalue for this HSP | ||
align_length | int(10) | YES | NULL | alignment length of HSP | ||
identical_matches | int(10) | YES | NULL | blast HSP match score | ||
perc_ident | int(10) | YES | NULL | percent identical matches in the HSP length | ||
positive_matches | int(10) | YES | NULL | blast HSP positive score | ||
perc_pos | int(10) | YES | NULL | precent positive matches in the HSP length | ||
hit_rank | int(10) | YES | NULL | rank in blast result | ||
cigar_line | mediumtext | YES | NULL | cigar string coding the actual alignment |
E.g. the rows:
mysql> SELECT * FROM peptide_align_feature WHERE qmember_id = 442105; +--------------------------+------------+------------+---------------+---------------+-··· | peptide_align_feature_id | qmember_id | hmember_id | qgenome_db_id | hgenome_db_id | +--------------------------+------------+------------+---------------+---------------+-··· | 1 | 442105 | 248885 | 3 | 12 | | 2 | 442105 | 86297 | 3 | 12 | | 3 | 442105 | 215369 | 3 | 12 | | 4 | 442105 | 67917 | 3 | 12 | | 5 | 442105 | 182642 | 3 | 12 | | 6 | 442105 | 212528 | 3 | 12 | | 7 | 442105 | 215342 | 3 | 12 | | 8 | 442105 | 260556 | 3 | 5 | +--------------------------+------------+------------+---------------+---------------+-··· ···-+-------------+--------+------+--------+------+----------+---------+--------------+-··· | analysis_id | qstart | qend | hstart | hend | score | evalue | align_length | ···-+-------------+--------+------+--------+------+----------+---------+--------------+-··· | 14 | 3 | 240 | 38 | 276 | 276.0000 | 1.1e-25 | 243 | | 14 | 32 | 244 | 8 | 220 | 224.0000 | 3.5e-20 | 219 | | 14 | 30 | 213 | 8 | 198 | 212.0000 | 6.6e-19 | 194 | | 14 | 32 | 244 | 2 | 219 | 176.0000 | 2.2e-14 | 224 | | 14 | 36 | 227 | 1 | 201 | 161.0000 | 5.2e-13 | 205 | | 14 | 30 | 219 | 2 | 208 | 158.0000 | 8.5e-12 | 214 | | 14 | 29 | 214 | 7 | 201 | 158.0000 | 9.8e-12 | 201 | | 15 | 32 | 242 | 8 | 224 | 249.0000 | 1.2e-22 | 220 | ···-+-------------+--------+------+--------+------+----------+---------+--------------+-··· ···-+-------------------+------------+------------------+----------+----------+-··· | identical_matches | perc_ident | positive_matches | perc_pos | hit_rank | ···-+-------------------+------------+------------------+----------+----------+-··· | 80 | 32 | 124 | 51 | 1 | | 66 | 30 | 112 | 51 | 2 | | 63 | 32 | 104 | 53 | 3 | | 60 | 26 | 107 | 47 | 4 | | 61 | 29 | 99 | 48 | 5 | | 61 | 28 | 98 | 45 | 6 | | 57 | 28 | 97 | 48 | 7 | | 73 | 33 | 109 | 49 | 1 | ···-+-------------------+------------+------------------+----------+----------+-··· ···-+---------------------------------------------+ | cigar_line | ···-+---------------------------------------------+ | 23M2I34MD42MI13MI51MD43M3D28M | | 31M2D9M3I27M2D15MI56MD39MI7MD20MI3M | | 27M2D7MD23M2D20M2I16MI33M2D9MD8M2D38M | | 40M3D8M3D3M2D24M2I8MI33MD21MD43MD16M3I11M | | 33MD5M2D8M5D20MI24MI29M4D17MD42M2I10M | | 25M3D17M2D21M6I57M13D18MD3M4DMI36MD5M | | 30MD7M3D9MI31M2I5M2D17M3I22M3D13M4D29M2D17M | | 33M2D7M2D20M4D9MI16MI55MD65MI3M | ···-+---------------------------------------------+
corresponds to all the hits found for the rat peptide defined by the member.member_id 442105.
This tables stores the phylogenetic structure of the genetrees.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
node_id | int(10) unsigned | NO | PRI | NULL | auto_increment | node unique ID |
parent_id | int(10) unsigned | NO | MUL | parent node id | ||
root_id | int(10) unsigned | NO | MUL | cluster node id (usually 1) where this genetree is attached | ||
left_index | int(10) | NO | MUL | left index of the tree-traversal binary search | ||
right_index | int(10) | NO | MUL | right index of the tree-traversal binary search | ||
distance_to_parent | double | NO | 1 | branch length between this node and parent node |
E.g. the rows:
mysql> select * from protein_tree_node where node_id=10; +---------+-----------+---------+------------+-------------+--------------------+ | node_id | parent_id | root_id | left_index | right_index | distance_to_parent | +---------+-----------+---------+------------+-------------+--------------------+ | 10 | 707637 | 1 | 7645 | 7646 | 0.001496 | +---------+-----------+---------+------------+-------------+--------------------+
corresponds to the node_id=10 with a branch length of 0.001496. This the right_index-left_index=1 value indicates that is an external node (leaf)
This tables stores the relationship between the protein_tree_node and the member tables.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
node_id | int(10) unsigned | NO | PRI | external reference to protein_tree_node.node_id | ||
member_id | int(10) unsigned | NO | MUL | external reference to member.member_id | ||
method_link_species_set_id | int(10) unsigned | NO | method_link_species_set_id | |||
cigar_line | mediumtext | YES | NULL | cigar string coding the alignment | ||
cigar_start | int(10) | YES | NULL | defines the first alignment aminoacid | ||
cigar_end | int(10) | YES | NULL | defines the last alignment aminoacid |
E.g. the rows:
mysql> select * from protein_tree_member where node_id=10; +---------+-----------+----------------------------+------------------------------+-------------+-----------+ | node_id | member_id | method_link_species_set_id | cigar_line | cigar_start | cigar_end | +---------+-----------+----------------------------+------------------------------+-------------+-----------+ | 10 | 416753 | 40046 | 106D53MD9M4D4MD42MD47MD47M8D | NULL | NULL | +---------+-----------+----------------------------+------------------------------+-------------+-----------+
corresponds to the node_id=10.
N.B.The cigar_start and cigar_end fields will be NULL in multiple sequence alignments (only used in pairwise alignments).
This tables stores various tags for nodes in protein_tree_node.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
node_id | int(10) unsigned | NO | MUL | node unique ID | ||
tag | varchar(50) | YES | MUL | NULL | tag | |
value | mediumtext | YES | NULL | value |
E.g. the rows:
mysql> select * from protein_tree_tag where node_id=707631; +---------+----------------------------+------------------+ | node_id | tag | value | +---------+----------------------------+------------------+ | 707631 | Bootstrap | 70 | | 707631 | Duplication | 0 | | 707631 | Sitewise_dNdS_runtime_msec | 58504.8859863281 | | 707631 | Sitewise_dNdS_subroot_id | 8 | | 707631 | taxon_alias | Primates | | 707631 | taxon_id | 9443 | | 707631 | taxon_name | Primates | +---------+----------------------------+------------------+
corresponds to the node_id=707631 and shows all the tags associated with this internal node.
This tables stores the phylogenetic structure of the NCBI taxonomy DB.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
taxon_id | int(10) unsigned | NO | PRI | taxon ID | ||
parent_id | int(10) unsigned | NO | MUL | parent taxon id | ||
rank | char(32) | NO | MUL | rank in the tree of life (family, genus, species, ...) | ||
genbank_hidden_flag | tynyint(1) | NO | 0 | genbank flag | ||
left_index | int(10) | NO | MUL | left index of the tree-traversal binary search | ||
right_index | int(10) | NO | MUL | right index of the tree-traversal binary search | ||
root_id | int(10) | NO | 1 | cluster node id (usually 1) where this taxonomy node is attached |
E.g. the rows:
mysql> select * from ncbi_taxa_node limit 10; +----------+-----------+--------------+---------------------+------------+-------------+---------+ | taxon_id | parent_id | rank | genbank_hidden_flag | left_index | right_index | root_id | +----------+-----------+--------------+---------------------+------------+-------------+---------+ | 1 | 0 | no rank | 0 | 1 | 891068 | 1 | | 2 | 131567 | superkingdom | 0 | 7665 | 234070 | 1 | | 6 | 335928 | genus | 0 | 97634 | 97669 | 1 | | 7 | 6 | species | 1 | 97635 | 97638 | 1 | | 9 | 32199 | species | 1 | 123547 | 123678 | 1 | | 10 | 135621 | genus | 0 | 129052 | 129111 | 1 | | 11 | 10 | species | 1 | 129075 | 129076 | 1 | | 13 | 203488 | genus | 0 | 50174 | 50197 | 1 | | 14 | 13 | species | 1 | 50179 | 50182 | 1 | | 16 | 32011 | genus | 0 | 70484 | 70531 | 1 | +----------+-----------+--------------+---------------------+------------+-------------+---------+
correspond to some taxon_ids in the NCBI taxonomy DB
This tables stores the names of the NCBI taxonomy entries.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
taxon_id | int(10) unsigned | NO | PRI | external reference to ncbi_taxa_node.taxon_id | ||
name | varchar(255) | YES | MUL | NULL | Name of the taxon | |
name_class | varchar(50) | YES | MUL | NULL | class of name |
E.g. the rows:
mysql> select * from ncbi_taxa_name where taxon_id=9606; +----------+--------------+---------------------+ | taxon_id | name | name_class | +----------+--------------+---------------------+ | 9606 | Homo sapiens | scientific name | | 9606 | human | genbank common name | | 9606 | man | common name | | 9606 | Human | ensembl alias name | | 9606 | human | ensembl common name | +----------+--------------+---------------------+
correspond to the human taxon_id names in the NCBI taxonomy DB
Contains all the genomic homologies found. There are two homology_member entries for each homology entry for now, but both the schema and the API can handle more than just pairwise relations.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
homology_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
stable_id | varchar(40) | YES | NULL | stable ID of the pairwise homology relationship | ||
method_link_species_set_id | int(10) unsigned | MUL | external reference to method_link_species_set.method_link_species_set_id | |||
description | varchar(40) | YES | NULL | describes the type of homology found:
|
||
subtype | varchar(40) | NO | The subtype defines the taxonomic level of the relationship found (taxon_name). | |||
dn | float(10,5) | YES | NULL | number of nonsynonymous substitutions per nonsynonymous site | ||
ds | float(10,5) | YES | NULL | number of synonymous substitutions per synonymous site | ||
n | float(10,1) | YES | NULL | number of nonsynonymous sites | ||
s | float(10,1) | YES | NULL | number of synonymous sites | ||
lnl | float(10,3) | YES | NULL | maximum likelihood test value | ||
threshold_on_ds | float(10,5) | YES | NULL | used by the EnsEMBL Web Browser to decide whether or not to display dN/DS ratio | ||
ancestor_node_id | int(10) unsigned | NO | ancestor node_id of the relationship | |||
tree_node_id | int(10) unsigned | NO | MUL | root node_id for the genetree where this relationship lies |
dN, dS, N, S and lnL are statistical values given by the codeml program of the Phylogenetic Analysis by Maximum Likelihood (PAML) package.
E.g. the row
mysql> select * from homology where homology_id=79; +-------------+-----------+----------------------------+------------------+-... | homology_id | stable_id | method_link_species_set_id | description | +-------------+-----------+----------------------------+------------------+-... | 79 | NULL | 22467 | ortholog_one2one | +-------------+-----------+----------------------------+------------------+-... ...+------------+---------+---------+------+------+----------+-----------------+------------------+--------------+ | subtype | dn | ds | n | s | lnl | threshold_on_ds | ancestor_node_id | tree_node_id | ...+------------+---------+---------+------+------+----------+-----------------+------------------+--------------+ | Catarrhini | 0.14610 | 0.21260 | 82.1 | 37.9 | -222.752 | 0.17120 | 284 | 284 | ...+------------+---------+---------+------+------+----------+-----------------+------------------+--------------+
defines a one to one orthology at the taxonomic level Catarrhini (Apes and old world monkeys).
N.B.At the moment there are no stable ids for homologies.
Contains the sequences corresponding to every genomic homology relationship found. There are two homology_member entries for each pairwise homology entry. As written in the homology table section, both schema and API can deal with more than pairwise relationships.
The original alignment is not stored but it can be retrieved using the cigar_line field and the original sequences. The cigar line defines the sequence of matches or mismatches and deletions in the alignment.
First peptide sequence: SERCQVVVISIGPISVLSMILDFY
Second peptide sequence: SDRCQVLVISILSMIGLDFY
First corresponding cigar line: 20MD4M
Second corresponding cigar line: 11M5D9M
The alignment will be:
First peptide cigar line | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | D | M | M | M | M |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
First aligned peptide | S | E | R | C | Q | V | V | V | I | S | I | G | P | I | S | V | L | S | M | I | - | L | D | F | Y |
Second aligned peptide | S | D | R | C | Q | V | L | V | I | S | I | - | - | - | - | - | L | S | M | I | G | L | D | F | Y |
Second peptide cigar line | M | M | M | M | M | M | M | M | M | M | M | D | D | D | D | D | M | M | M | M | M | M | M | M | M |
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
homology_id | int(10) unsigned | PRI | 0 | external reference to homology.homology_id | ||
member_id | int(10) unsigned | PRI | 0 | external reference to member.member_id.It refers to the corresponding gene (ENSEMBL_GENE). | ||
peptide_member_id | int(10) unsigned | YES | NULL | external reference to member.member_id.It refers to the peptide/protein (ENSEMBL_PEP). | ||
peptide_align_feature_id | int(10) unsigned | YES | NULL | external reference to peptide_align_feature.peptide_align_feature_id | ||
cigar_line | mediumtext | YES | NULL | an internal description of the alignment. It contains mathces/mismatches (M) and delations (D)and refers to the corresponding peptide_member_id sequence. | ||
cigar_start | int(10) | YES | NULL | defines the first aligned aminoacid | ||
cigar_end | int(10) | YES | NULL | defines the last aligned aminoacid | ||
perc_cov | int(10) | YES | NULL | defines the percentage of the peptide which has been aligned | ||
perc_id | int(10) | YES | NULL | defines the percentage of identity between both homologues | ||
perc_pos | int(10) | YES | NULL | defines the percentage of positivity (similarity) between both homologues |
E.g. the rows
mysql> select * from homology_member where homology_id = 296648; +-------------+-----------+-------------------+--------------------------+------------+-··· | homology_id | member_id | peptide_member_id | peptide_align_feature_id | cigar_line | +-------------+-----------+-------------------+--------------------------+------------+-··· | 296648 | 685 | 722 | NULL | 77M | | 296648 | 406456 | 406463 | NULL | 77M | +-------------+-----------+-------------------+--------------------------+------------+-··· ···-+-------------+-----------+----------+---------+----------+ | cigar_start | cigar_end | perc_cov | perc_id | perc_pos | ···-+-------------+-----------+----------+---------+----------+ | 1 | 77 | 30 | 79 | 87 | | 91 | 167 | 38 | 79 | 87 | ···-+-------------+-----------+----------+---------+----------+
refer to the two homologue sequences defined by the homology.homology_id 296648. The gene corresponding to the first sequence can be retrieved using the member.member_id 685 and the corresponding peptide using the member.member_id 722. Gene and peptide sequence of the second homologue can retrieved in the same way.
N.B.At the moment the peptide_align_feature_ids are NULL because they are only indirectly used in the homology prediction but left there for compatibility with the old BRH system.
Contains the site wise dN/dS values for the codon alignments.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
sitewise_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
aln_position | int(10) unsigned | NO | MUL | alignment position of the codon/aminoacid | ||
node_id | int(10) unsigned | NO | MUL | root node_id for the subtree used in the analysis. External reference to protein_tree_node.node_id | ||
tree_node_id | int(10) unsigned | NO | root node_id for the tree. External reference to protein_tree_node.node_id | |||
omega | float(10,5) | YES | NULL | Omega (dn/ds) value | ||
omega_lower | float(10,5) | YES | NULL | Lower bound of support interval (confidence interval) for omega. | ||
omega_upper | float(10,5) | YES | NULL | Upper bound of support interval (confidence interval) for omega. | ||
threshold_on_branch_ds | float(10,5) | YES | NULL | Synonymous length used in the analysis at which a branch is considered to be subject to saturation of synonymous mutation. | ||
type | varchar(10) | NO | describes the type of site found:
|
E.g. the row
mysql> select * from sitewise_aln where sitewise_id=1; +-------------+--------------+---------+--------------+-... | sitewise_id | aln_position | node_id | tree_node_id | +-------------+--------------+---------+--------------+-... | 1 | 31 | 1827 | 1827 | +-------------+--------------+---------+--------------+-... ...+---------+-------------+-------------+------------------------+-----------+ | omega | omega_lower | omega_upper | threshold_on_branch_ds | type | ...+---------+-------------+-------------+------------------------+-----------+ | 9.23330 | 3.22020 | 23.21050 | 1.50000 | positive3 | ...+---------+-------------+-------------+------------------------+-----------+
defines a positively selected position with an omega value omega=9.23330 with a confidence interval of (3.2,23.2) calculated using a threshold_on_branch_ds=1.5 for the genetree with root 1827 (in this case a full genetree, not a subalignment as node_id=tree_node_id).
Contains all the group homologies found. There are several family_member entries for each family entry.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
family_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
stable_id | varchar(40) | UNI | stable family ID | |||
method_link_species_set_id | int(10) unsigned | 0 | external reference to method_link_species_set.method_link_species_set_id | |||
description | varchar(255) | YES | MUL | NULL | description of the family as found using the Longest Common String (LCS) of the descriptions of the member proteins. | |
description_score | double | YES | NULL | Scores the accuracy of the annotation (max. 100) |
E.g. the row
mysql> select * from family where family_id=4000; +-----------+-------------------+----------------------------+-... | family_id | stable_id | method_link_species_set_id | +-----------+-------------------+----------------------------+-... | 4000 | fam51v00000004000 | 30017 | +-----------+-------------------+----------------------------+-... ...+--------------------------------------------------------------------+-------------------+ | description | description_score | ...+--------------------------------------------------------------------+-------------------+ | NUCLEAR PORE COMPLEX NUP205 NUCLEOPORIN NUP205.205 KDA NUCLEOPORIN | 100 | ...+--------------------------------------------------------------------+-------------------+
defines a family homology found which stable ID is fam51v00000004000 and the description of this family is "NUCLEAR PORE COMPLEX NUP205 NUCLEOPORIN NUP205.205 KDA NUCLEOPORIN" scored with a 100.
Contains the proteins corresponding to protein family relationship found. There are several family_member entries for each family entry.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
family_id | int(10) unsigned | PRI | 0 | external reference to family.family_id | ||
member_id | int(10) unsigned | PRI | 0 | external reference to member.member_id | ||
cigar_line | mediumtext | YES | NULL | internal description of the multiple alignment (see homology_member table) |
E.g. the rows
mysql> SELECT * FROM family_member WHERE family_id = 13252; +-----------+-----------+------------+ | family_id | member_id | cigar_line | +-----------+-----------+------------+ | 13252 | 69013 | NULL | | 13252 | 69028 | 26D348M | | 13252 | 217503 | NULL | | 13252 | 217511 | 374M | | 13252 | 823691 | 26D348M | +-----------+-----------+------------+
refer to the five members of the protein family 13252. The proteins can be retieved using the member_ids. The multiple alignment can be restored using the cigar_lines.
Not used by now
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
domain_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
stable_id | varchar(40) | |||||
method_link_species_set | int(10) unsigned | 0 | external reference to method_link_species_set.method_link_species_set_id | |||
description | varchar(255) | YES | NULL |
Not used by now
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
domain_id | int(10) unsigned | MUL | 0 | |||
member_id | int(10) unsigned | MUL | 0 | |||
member_start | int(10) | YES | NULL | |||
member_end | int(10) | YES | NULL |