Variable Relevancy Analysis / Factor Analysis for Predictive Modeling

One of the most important steps in predictive modeling is to identify relevant independent variables that are significantly correlated to the dependent/target variable. Variables with weak correlation should not be used in predictive models as they will only increase overfitting and decrease accuracy. Only independent variables with strong correlation to the dependent variable should be used in predictive models. CMSR Data Miner provides a correlation analysis tool. The following figure shows the tool;

Factor Analysis

The figure shows correlation to the dependent variable "RISKFLAG". The right side window frame lists independent variables and category items in the order of strong correlation. "r-value" is the correlation coefficients. "r-sqaure" is the squared coefficients. Higher the absolute values are, the more relevant. Top listed items and variables are good candidates for inclusion as independent variables. The tool is very powerful and easy to use.

In addition to the correlation analysis tool, categorical bar charts and histograms can be used to identify modeling variables. This process should include data transformation as well. Categorical variables can be transformed into a smaller number of numerical flag variables. Numerical variables can be also transformed into numerical flag variables. These procedures are described fully in the paper "Modeling-Guide-for-Neural-Network.pdf". (Available to CMSR users.)