Rosella       Machine Intelligence & Data Mining

Correlation and Link Analysis

Link analysis takes many forms. The most common use of link analysis is web-page hyper-text links, as described in Web Navigation Pattern Analysis. Another common usage is in Healthcare Fraud Detection. It analyzes association between providers and patients potentially involved in healthcare insurance scams. Other usage of link analysis is described in the subsequent sections.

Correlation analysis and Categorical data

Correlation coefficient is a numerical measurement of linear association between two numerical variables. Correlation analysis is very important in selecting variables for clustering/segmentation and predictive modeling. Correlation coefficients range between -1 and 1. If two variables are perfectly negatively correlated, the coefficient is -1. If two are perfectly positively correlated, it is 1. Simple coefficient computation reveals linear correlations as shown below;

(Linear correlation)
An example of positive correlation An example of negative correlation

However, the following type correlation can not be exposed with simple computation. To compute non-linear correlation, CMSR employs advanced techniques to identify non-linear correlation.

(Non-linear correlation)
An example of non-linear correlation

In addition, correlation coefficients can not be computed directly from categorical variables. Normally, linearization techniques are used.

(Non-linear Categorical Correlation)
An example of categorical correlation
Case study: retail store sales and trend analysis

Correlation analysis is primarily used as a pre-analysis tool for predictive modeling. It is normally used to determine variables that may be used in predictive models. For the same rationale, it can be used in determining variables that may have bearing in sales of retail stores. That is to say, assume retails stores having variety of information: geographic & demographic composition of business areas, say, from census and GIS data. Correlation link analysis can be used to determine which variables have close correlation to sales revenues of retail stores. Then this information is used in developing segmentation for Sales Trend Analysis and Forecasting. A superb tool for sales management!


Correlation analysis is a feature of CMSR Data Miner. Download is available from Data Mining Software.