genemodel_pipeline



mm9 numbers by annotation 
annotations 21224
annotation is all alternative 657
annotation is all constitutive 11090
merged refseq 105

mm9 numbers by span          
spans 435538
constitutive exons 148495
constitutive junctions 120407
alternative junctions 84604
alternative exon spans 82032


scripts are packaged in in onepipe_template.tgz
parameters below are specific to mm9_f08 build.

Input data for mm9_f08 include UCSC knownGenes
The scripts are capable of handling refSeq, ESTs,mRNAs

1.) Load spans into database : each exon and junction of each transcript gets assigned to a span.
Chrom,start,end,strand,type are unique.
An isoform list keeps track of isoforms that contain the span.
Junctions less than 30 bases in length are not used. Surrounding exons are joined together.

2.) Filtering : remove if only in one EST, Not used in mm9_f08

3.) Filtering : drop bad splice sites (and not constitutive)

Accepted splice sites
/GT/ /AG/
/AT/ /AC/
/AT/ /AG/
/GC/ /AG/
/GA/ /AG/
/GT/ /TG/

4.) Filtering : remove exons < 8 in length (and not constitutive)

5.) Stretching :  Terminal exons are stretched if the end of a terminal (for an isoform) exon span
falls within another exonic region, extend that end to the next nearest splice site/exon boundary.  
Script iterates until no further stretching is required.

6.) Trimming :  Terminal exons are trimmed only upto 50 bases if the end of a terminal (for an isoform) exon span falls within an intronic region,
TRIM that end to the next nearest splice site/exon boundary IF it is <50bp away.
Script iterates until no further trimming is required.

7.) Nonoverlap  : Exon spans are broken up into exonnic regions and exon-exon junction spans (ej)

8.) No nulls : An error check to make sure every span is assigned to an annotation. 

9.) No duplications : An error check to make sure no annotations are duplicated to other chroms or strands.

10.) Annotations : The splicing graph is walked to group spans into linked annotations. As spans are joined the annotation name is propagated to all linked spans.

11) hname : A human friendly name is assigned to each annotation. It must be unique. Order of preference :refSeq , knownGenes , mRNAs

12.) Walk and label exons: A splicing graph is created for each annotation.It is walked and exonic positions are labeled 5' to 3'.

13.) Walk and label junctions: Junctions for each annotaion are labeled by starting position and ending postion.

14.) Walk and label alt_regions: Alt_regions are labelled 5' to 3'

15.) Bed files: Bed files of the annotations and spans are created for display in the browser

16.) some hand editting to prevent merging of refSeqs may be required. This is done by moving txStart or txEnd to seperate the annotations


Walk and Label Example

Span Table

The Span Table is created using non-overlapping regions. These regions are used as labels to describe the exon,intron or junction. The last two numbers in the name correspond to the start and end positions. For example, the three isoforms below would result in the following spans.

name constituitive? alternative region type of span
gene.0.1.1.ex yes 0 Exon
gene.1.2.2.ex no 1 Exon
gene.1.3.3.in no 1 Intron
gene.1.4.4.ex no 1 Exon
gene.0.5.5.ex yes 0 Exon
gene.2.6.6.ex no 2 Exon
gene.2.7.7.in no 2 Intron
gene.0.8.8.ex yes 0 Exon
Junctions
gene.1.1.2.ej no 1 Exon-Exon Junction
gene.1.1.5.sj no 1 Intron Splice Junction
gene.1.2.4.sj no 1 Intron Splice Junction
gene.1.4.5.ej no 1 Exon-Exon Junction
gene.2.5.6.ej no 2 Exon-Exon Junction
gene.2.5.8.sj no 2 Intron Splice Junction
gene.2.6.8.sj no 2 Intron Splice Junction



classifying events