This document gives a high-level description of the tables that make up the EnsEMBL core schema. Tables are grouped into logical groups, and the purpose of each table is explained. It is intended to allow people to familiarise themselves with the schema when encountering it for the first time, or when they need to use some tables that they've not used before. Note that while some of the more important columns in some of the tables are discussed, this document makes no attempt to enumerate all of the names, types and contents of every single table. Some concepts which are referred to in the table descriptions are given at the end of this document; these are linked to from the table description where appropriate.

Different tables are populated throughout the gene build process:

Step Process
0 Create empty schema, populate meta table
1 Load DNA - populates dna, clone, contig, chromosome, assembly tables
2 Analyze DNA (raw computes) - populates genomic feature/analysis tables
3 Build genes - populates exon, transcript,etc. gene-related tables
4a Analyze genes - populate protein_feature, xref tables, interpro
4b ID mapping

This document refers to version 63 of the EnsEMBL core schema.


Concepts

co-ordinates

There are several different co-ordinate systems used in the EnsEMBL database and API. For every co-ordinate system, the fundamental unit is one base. The differences between co-ordinate systems lie in where a particular numbered base lies, and the start position it is relative to. CONTIG co-ordinates, also called 'raw contig' co-ordinates or 'clone fragments' are relative to the first base of the first contig of a clone. Note that the numbering is from 1, i.e. the very first base of the first contig of a clone is numbered 1, not 0. In CHROMOSOMAL co-ordinates, the co-ordinates are relative to the first base of the chromosome. Again, numbering is from 1. The seq_region table can store sequence regions in any of the co-ordinate systems defined in the coord_system table.

supercontigs

A supercontig is made up of a group of adjacent or overlapping contigs.

sticky_rank

The sticky_rank differentiates between fragments of the same exon; i.e for exons that span multiple contigs, all the fragments would have the same ID, but different sticky_rank values

stable_id

Gene predictions have changed over the various releases of the EnsEMBL databases. To allow the user to track particular gene predictions over changing co-ordinates, each gene-related prediction is given a 'stable identifier'. If a prediction looks similar between two releases, we try to give it the same name, even though it may have changed position and/or had some sequence changes.

cigar_line

This allows the compact storage of gapped alignments by storing the maximum extent of the matches and then a text string which encodes the placement of gaps inside the alignment. Colloquially inside EnsEMBL this is called a and its adoption has shrunk the number of rows in the feature table around 4-fold.