MS-CS-L Archives

September 2009


Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Jyh-Ming Lien <[log in to unmask]>
Reply To:
Jyh-Ming Lien <[log in to unmask]>
Fri, 18 Sep 2009 13:29:36 -0400
text/plain (86 lines)
*    GRAND Seminar


 From recognizing biological sequences, to identifying search keywords:
A feature generation framework

12:00 noon, September 22, Tuesday, 2009, ENGR 4201


Amarda Shehu


Rezarta Islamaj
Research Fellow
National Center for Biotechnology Information (NCBI)


The set of attributes or features selected to model an entity is very
important for correct classification. In this talk I will present an
integrated process, which I refer to as feature generation. This method
allows the user to construct informative features based on domain
knowledge, and to search a large space of potential features

I applied this approach to the problem of splice-site prediction and
obtained new predictive models for these biological signals for two
different organisms. These models have achieved significant improvements
in accuracy over existing, state-of-the-art approaches. In each case,
the identified sets of features were used to discover biologically
interesting motifs. They are available to the public through an
easy-to-use website, SplicePort ( Spliceport
can be used to predict new splice sites from user-input sequences, and
to browse the whole collection of features for biologically significant

I also applied this approach to the problem of keyword identification
for effective document retrieval. The automatic identification of
"clickable" words in the title and abstract of articles is of central
importance in improving the retrieval quality of the search engine. It
is also important to authors as it increases the chances that their
article will get better visibility. PubMed
(, a free Web service provided by the
U.S. National Library of Medicine, provides daily access to over 19
million biomedical citations for millions of users. The current
retrieval algorithm in PubMed finds all the articles that match the
terms in the user query and presents them in reverse chronological
order. I studied PubMed log data for the clickthrough activities of
users after they have issued a query. Linking the query terms to the
clicked articles, I built a novel machine learning model that identifies
"keywords" that are preferred by users to access a particular article.

*Short Bio*

Dr. Rezarta Islamaj received her Ph.D. degree in Computer Science from
University of Maryland at College Park in 2007. Her research focused on
applying machine learning and data mining approaches to computational
biology problems. Specifically she worked on construction, selection and
discovery of appropriate motifs to model biological signals for accurate
classification and prediction.

Currently, she is a Research Fellow at the Computational Biology Branch,
National Center for Biotechnology Information (NCBI). NCBI is part of
the National Library of Medicine at NIH. Her current research focuses on
understanding user search behaviours when using NCBI databases,
specifically PubMed, in order to improve retrieval quality and

*Jyh-Ming Lien*
Assistant Professor, George Mason University