LISTSERV - BIOSCIENCES-L Archives

The speaker will be Dr. Elizabeth Hohman (NSWCDD)

TITLE: Statistical Methods in Text Analysis

ABSTRACT: This talk is structured like a mini-tutorial of text
analysis using the R programing language and environment. We use
PubMed to download an example corpus and perform the parsing,
classification, and clustering in R. Instead of using R text packages
such as tm, we represent the documents as a matrix and apply some
standard classification and clustering techniques. All code is
included in the slides and can be run on your own PubMed download. The
focus is on understanding the math behind the techniques, not on
efficiency. After understanding basics such as the TFIDF (term
frequency inverse document frequency) representation of a corpus, one
can be better prepared to use the available text mining packages.

--