Anomaly/Outlier Detection using Excel and SQL Server Analysis Services

Ahmed Anwar, 16 December 2015

In this article, I’m going to talk about how to use the SQL Server Data Mining Excel Add-in to detect odd patterns in records or datasets.

Anomaly detection (or outlier detection) is the process of identifying unexpected patterns in a dataset. Typically anomalous problems appear in Banks, errors in text or sensor detectors and many other domains. Anomalies may also be referred to as outliers, noise or exceptions.

A practical application of using outlier detection is in Insurance companies. Imagine a salesman in an insurance company who reviews and audits application requests from customers. An experienced salesman can spot any suspicious or incorrect information in applications. For example, say a 20 year old male who earns $150K salary requests insurance for his $1000 car. But another 50 year old applicant requests a plan for an expensive car and also earns $150K salary. The earlier application will be rejected due to data inconsistency.

In this example, it is easy to find correlations for three attributes:

  1. Car market value
  2. Age
  3. Salary

Analyzing and processing a table that contains ten columns with thousands of records is considerably difficult. Machine learning plays a significant role to solve this kind of problem. Machine Learning has been around for decades and many companies, enterprises and governments use algorithms to detect outliers. Data mining algorithms can tremendously help companies understand the needs of their customers and predict the profit for the future using different algorithms. These algorithms were released in SSAS since 2005. The good news is that you don’t need to be an advanced mathematician or a software engineer to use these algorithms.

Let’s see how we can find outliers in a dataset using Excel’s Data Mining Add-on.

Prerequisites

  1. Make sure to install SQL Server Analysis Services on your machine or on your network and you have permission to access to it.
  2. Download the SQL Server Data Mining Excel Add-in

Demo

  1. Go to Codeplex and download AdventureWorksDW2012 Data File.
  2. Attach the database in your SQL Server Engine.
  3. Open Excel, select a blank workbook.
  4. Click on Data tab and select From SQL Server.
    image
  5. Enter the server name and select the Authentication Type and click Next.
  6. Select AdventureWorksDW2012 from the list and select ProspectiveBuyer below. Excel will connect to the Database server and load around 2000 records into the current sheet.
  7. If you installed the Data Mining Excel Add-on, you should see a new tab called Analyze.
  8. Click on the Analyze tab and create a new connection to the SQL Service Analysis Services
    image
  9. Enter the server name that hosts the SSAS instance and select (Default) if you have no Catalogs created
    image
  10. Click OK and Close.
  11. In the same tab click on Highlight Exceptions button.
    image
  12. A dialog appears with a list of all columns added to the current sheet and a few of them are already selected.
    Excel tries to figure out which columns can influence the outcome of the resulting data by quickly scanning the columns and finding patterns. For example, unique keys and names are automatically excluded from the selection. However, Excel has not excluded the Salutation column or the Phone column because it thinks that these columns contain useful information that can be used for predictions. Please note that selecting irrelevant columns might pollute the resulting outcomes, so make sure you select relevant columns that can give correct results. For more information about accurately selecting columns, you can read more about Feature Engineering in Quora. To keep it simple, select the columns as the screen shot below.
    image
  13. Click Run. It should take a few seconds to detect outliers.
  14. After processing is finished, Excel will create a new sheet that contains the selected columns and the number of outliers detected by the algorithm.
    image
  15. Looking at this generated report gives a brief summary about the distribution of the outliers in all records.
  16. Click on the original sheet and scroll down until you find dark yellow highlights.
    clip_image002

A row with dark yellow highlight means that this record is an outlier. Cells highlighted with bright yellow colour means that the value caused the whole row to be an outlier.
For example, Ian Garcia at Row 875 is highlighted. The cell that caused the row to be an outlier is the Birth Date. Take a moment and think of it. Ian is born in 1923 and earns 10,000 per anum but other younger people earn more than Ian’s income, so there must be something wrong with the birth date. Perhaps, for some reason, someone has incorrectly entered the birth date.

Please note that Adventure Works is fictional data and it is not 100% correct. If you have a real dataset, grab it and analyze it. Perhaps, you may see new patterns that no one has ever spotted before.

How does it work?

What you have witnessed here is a simple version of famous algorithm called Clustering algorithm.

This algorithm iterates through all data points to find clusters or groupings in the dataset. The clusters are created in multiple dimensions based on the selected columns. After clusters are created, the algorithm computes the centroids of these clusters.

The algorithm finds the distance between each data point and the nearest cluster centroid, if it is close enough then it is ok, if not then the data point or the record is considered to be an anomaly. Looking at the diagram below, you’ll see an example of four clusters. A data point that does not belong to a cluster or far from the nearest cluster is considered as an exception.

imageBehind the scenes, the add-on builds a query called Data Mining Extensions or DMX and submits it to SSAS. Then SSAS does the heavy work and creates Mining Structure, trains a new Mining Model, creates clusters and then returns back the results to Excel to render the results in the selected workbook as we saw above.