PhD in Information Technology Doctoral Dissertation Defense Announcement Candidate: Juan (Judy) Luo Bachelor of Science, Wuhan University, 1997 Master of Science, University of Nebraska-Lincoln, 2003 REGRESSION LEARNING IN DECISION GUIDANCE SYSTEMS: MODELS, LANGUAGES AND ALGORITHMS Monday, April 30, 2012 3:00PM -5:00PM Nguyen Engineering Building, Room 2901 All are invited to attend. Committee Alexander Brodsky, Chair Larry Kerschberg Carlotta Domeniconi Ruixin Yang Abstract The state-of-art research in the decision guidance applications is trying to build complex systems with predicting capability. This dissertation focuses on a framework, models, languages, and algorithms to integrate the machine learning functionality (regression learning) into DGMS applications as their first class citizen. A framework CoReJava (Constraint Optimization Regression in Java), which extends the Java programming language with regression learning -- the ability of parameter estimation for a function, is proposed and developed. CoReJava is unique in that functional forms for regression analysis are expressed as first class citizens, i.e., as Java programs, in which some parameters are not given in advance, but will be learned from learning data sets provided as input.The if-then-else decision structures of Java language are naturally adopted to represent piecewise functional forms of regression. Thus, minimization of the sum of squared errors involves an optimization problem with a search space that is exponential to the size of learning set. A combinatorial restructuring algorithm is proposed to guarantee learning optimality and furthermore reduce the search space to be polynomial in the size of learning set, but exponential to the number of piece-wise bounds. A Heaviside restructuring algorithm, which expresses the piecewise linear regression function using a unified functional format, instead of multiple pieces, is proposed to decrease the searching complexity further to be polynomial in both the size of learning set and the number of piece-wise bounds, while the learning outcome will be an approximation of the optimality. A multi-step Expectation Maximization based (EM-based) algorithm (EMMPSR) is proposed to solve piecewise surface regression problem. The multiple steps involved are local regression on each data point of the training data set and a small set of its closest neighbors, clustering on the feature vector space formed from the local regression, regression learning for each individual surface, and classification to determine the boundaries for each individual surface. An EM-based iteration process is introduced in the regression learning phase to improve the learning outcome. The reassignment of a cluster identifier for every data point in the training set is determined by predictive performance of each submodel. Clustering quality validity indices are applied to the scenario in which the number of piecewise surfaces is not given in advance. The Relational Database Management System (RDBMS) is extended with the piecewise regression learning capability as well. The functional forms are represented as database tables. The EMMPSR algorithm is implemented as stored procedures. A case study is undertaken to describe the decision optimization process based on the learning outcome of the multi-step Expectation Maximization based (EM-based) algorithm. Evaluation of the resulting research is established by experiments and empirical analysis in comparison with those of related regression learning packages. A copy of this doctoral dissertation is on reserve at the Johnson Center Library.