Correction:
[log in to unmask]" type="cite">Thesis Defense Announcement
To The George Mason University Community


Candidate: Philip Goetz
Program: Master of Science in Bioinformatics & Computational Biology

Date:   Thursday December 1, 2011
Time:   1:00 p.m.
Place:  George Mason University, Prince William campus
	    Discovery Hall Room 224
 
Thesis Chair:  Dr. Iosif Vaisman

Title: "AUTOMATED CONSTRUCTION OF A RANKING SYSTEM FOR AUTOMATIC FUNCTIONAL
GENE ANNOTATION"

  
A copy of the thesis is on reserve in the Johnson Center Library, Fairfax campus.  The thesis will not be read at the meeting, but should be read in advance.
All members of the George Mason University community are invited to attend.

ABSTRACT:
One key method of automatic functional annotation of a gene is finding
BLAST hits to the gene in question that have functional annotations,
choosing the best single hit, and copying the annotation from that hit
if it is of sufficient quality as measured by a p-value or other
criterion.  In the JCVI prokaryote automatic functional annotation
system, the best hit is chosen by looking up categories in a
manually-constructed table stating how reliable the annotation is
depending on who made the annotation, what percent identity the BLAST
hit had, and what percentage of the query gene and the hit gene were
involved in the match.

Constructing this table is labor-intensive; and humans are incapable
of processing enough data to construct it correctly.  I therefore
reduced the data requirements by breaking the table into orthogonal
components; and I developed an iterative method to minimize the
least-squares error of the table on a training set.  I also
constructed a validation set of 50,000 manually-annotated proteins
from JCVI data, and developed a protein name thesaurus and ontology to
make it possible to tell when two names meant the same thing, or when
one name was a more-specific refinement of another name.  Training on
9/10 of the validation set, and testing on the held-out 1/10, showed
an improvement in accuracy from 71.8% to 77.7%.
  
###