The speaker will be Dr. Elizabeth Hohman (NSWCDD)

TITLE: Statistical Methods in Text Analysis

ABSTRACT: This talk is structured like a mini-tutorial of text

analysis using the R programing language and environment. We use

PubMed to download an example corpus and perform the parsing,

classification, and clustering in R. Instead of using R text packages

such as tm, we represent the documents as a matrix and apply some

standard classification and clustering techniques. All code is

included in the slides and can be run on your own PubMed download. The

focus is on understanding the math behind the techniques, not on

efficiency. After understanding basics such as the TFIDF (term

frequency inverse document frequency) representation of a corpus, one

can be better prepared to use the available text mining packages.--