This paper compares two different predictive data-mining techniques (one linear technique, Partial Least Squares (PLS) and one nonlinear technique(NLPLS)) on two different and unique data sets: a collinear data set (called "the COL" data set in this paper) and a simulated data set (called "the Simulated" data in this paper). These data are unique, having a combination of the following characteristics: few predictor variables, many predictor variables, highly collinear variables, very redundant variables and presence of outliers. The natures of these data sets are explored and their unique qualities defined. This is called data pre-processing and preparation. To a large extent, this data processing helps the miner/analyst to make a choice of the predictive technique to apply. The big problem is how to reduce these variables to a minimal number that can completely predict the response variable. the Partial Least Squares (PLS, a supervised technique), and the Nonlinear Partial Least Squares (NLPLS), which uses some neural network functions to map nonlinearity into models, were applied to each of the data sets. Each technique has different methods of usage; these different methods were used on each data set first and the best method in each technique was noted and used for global comparison with other techniques for the same data set. The purpose of this is to identify the technique that performs best for a given type of data set and to use it directly instead of relying on the usual trial-and-error approach. When this process is effectively used, it will reduce the lead time in building models for predictions or forecasting for business planning. The work in this Research paper will also be helpful in identifying the very important predictive data-mining performance measurements or model evaluation criteria.
PLS, NLPLS, COL, PDM