In this article, I’m going to talk about how to use the SQL Server Data Mining Excel Add-in to detect odd patterns in records or datasets.
Anomaly detection (or outlier detection) is the process of identifying unexpected patterns in a dataset. Typically anomalous problems appear in Banks, errors in text or sensor detectors and many other domains. Anomalies may also be referred to as outliers, noise or exceptions.
A practical application of using outlier detection is in Insurance companies. Imagine a salesman in an insurance company who reviews and audits application requests from customers. An experienced salesman can spot any suspicious or incorrect information in applications. For example, say a 20 year old male who earns $150K salary requests insurance for his $1000 car. But another 50 year old applicant requests a plan for an expensive car and also earns $150K salary. The earlier application will be rejected due to data inconsistency.
In this example, it is easy to find correlations for three attributes:
Analyzing and processing a table that contains ten columns with thousands of records is considerably difficult. Machine learning plays a significant role to solve this kind of problem. Machine Learning has been around for decades and many companies, enterprises and governments use algorithms to detect outliers. Data mining algorithms can tremendously help companies understand the needs of their customers and predict the profit for the future using different algorithms. These algorithms were released in SSAS since 2005. The good news is that you don’t need to be an advanced mathematician or a software engineer to use these algorithms.
Let’s see how we can find outliers in a dataset using Excel’s Data Mining Add-on.
A row with dark yellow highlight means that this record is an outlier. Cells highlighted with bright yellow colour means that the value caused the whole row to be an outlier.
For example, Ian Garcia at Row 875 is highlighted. The cell that caused the row to be an outlier is the Birth Date. Take a moment and think of it. Ian is born in 1923 and earns 10,000 per anum but other younger people earn more than Ian’s income, so there must be something wrong with the birth date. Perhaps, for some reason, someone has incorrectly entered the birth date.
Please note that Adventure Works is fictional data and it is not 100% correct. If you have a real dataset, grab it and analyze it. Perhaps, you may see new patterns that no one has ever spotted before.
What you have witnessed here is a simple version of famous algorithm called Clustering algorithm.
This algorithm iterates through all data points to find clusters or groupings in the dataset. The clusters are created in multiple dimensions based on the selected columns. After clusters are created, the algorithm computes the centroids of these clusters.
The algorithm finds the distance between each data point and the nearest cluster centroid, if it is close enough then it is ok, if not then the data point or the record is considered to be an anomaly. Looking at the diagram below, you’ll see an example of four clusters. A data point that does not belong to a cluster or far from the nearest cluster is considered as an exception.
Behind the scenes, the add-on builds a query called Data Mining Extensions or DMX and submits it to SSAS. Then SSAS does the heavy work and creates Mining Structure, trains a new Mining Model, creates clusters and then returns back the results to Excel to render the results in the selected workbook as we saw above.