Outliers in Data Mining: A Data Analyst‘s Guide to Identifying and Handling Aberrant Observations

Hello there! If you‘ve ever felt bewildered dealing with outliers in your data science projects, this guide is for you. We‘ll explore what constitutes an outlier, methods to detect them, best practices for analysis and the overall impact of properly handling these deviant observations.

Ready? Let‘s dive in!

What Are Outliers and Why Do They Matter?

As a data analyst, you collect, clean, transform and apply analytical models on data to drive insights. However, real-world data often contains outliers – observations that lie abnormally too far from the norm.

For example, here is browser usage statistics for a website:

Browser	Percentage Use
Chrome	64%
Firefox	28%
Safari	7%
Opera	1%

Now imagine one visitor who still uses the outdated Blackberry browser. Their usage would likely show up as an outlier point in the data.

Such outliers can heavily influence data science initiatives:

Skew averages, correlations and model predictions leading to poor results
Conceal patterns by appearing as part of sparse regions
Signal data errors or abnormal events needing investigation

In short, if left undetected, outliers short-circuit the promise of data-driven decision making through inaccurate, misleading results.

This guide equips you with practical understanding of outliers – including types, detection methods and principled ways for analysis post identification. Let‘s get started!

Types of Outliers

Broadly, outliers manifest in three ways – global, collective and contextual:

Global Outliers

Global outliers represent single extreme points relative to the overall data. For example, here is purchase data for an e-commerce store:

The $512 outlier lies distinct from the main purchase distribution – it could signal an abnormal event like fraud.

Such outliers arise due to:

Data errors – typos, measurement glitches
Intentional anomalies – cyber attacks, fraud
Natural deviations – genuine yet rare events

They require analyzing related metadata to determine underlying causes.

Collective Outliers

Unlike individual points, collective outliers manifest as groups of related instances behaving differently from the broader data.

For example, this scatter plot shows ratings of songs released by a musician over time:

The two clusters of low rated songs represent collective outliers. The musician likely experimented with a different genre for those albums which fans didn‘t enjoy!

Specialized techniques like clustering algorithms can isolate such collective anomalies. Investigating the shared characteristics, contexts and potential causes for outlier groups poses an intriguing analytical challenge.

Contextual Outliers

Contextual outliers represent observations that deviate strongly from a specific subgroup or time window even if they seem normal relative to complete data.

For example, the chart below depicts website session durations segmented by visitor geography:

Visitors from Australia and India exhibit similar behavior except for the outlying 1051 seconds session. Analysis must happen in context – an Australian user spent way longer than typical Australia traffic.

With outlier flavors explained, let‘s now see how to detect them.

Statistical and Machine Learning Detection Methods

Choosing the right detection method depends on data size, dimensions and distributions. Here are some popular techniques:

Z-score Method

The z-score calculates how many standard deviations an observation is from the mean. Extreme values like z > 3 or z < -3 indicate outliers:

Pros:

Simple and fast to compute
No dependencies between variables

Cons:

Assumes normal distribution

Use when data is small and normal distribution holds approximately.

Interquartile Range Rule (IQR)

The IQR first computes Q1 (25th) and Q3 (75th) percentiles. Observations outside [Q1 – 1.5×IQR, Q3 + 1.5×IQR] qualify as outliers.

Pros:

Distribution independent
Resilient against outliers

Cons:

Lower sample efficiency

Use for any data distribution with sufficiently large samples.

Additionally, machine learning approaches like isolation forests, local outlier factor and autoencoders provide automated outlier scoring. We won‘t dive into them here, but I‘d be happy to explain their working in a future post – let me know in comments!

Okay, you now have your outliers detected. But the job doesn‘t end here…

Analyzing and Handling Outliers Post Identification

Simply eliminating every outlier without investigation risks losing crucial insights. Consider the following prudent practices instead:

Understand the causes

Analyze metrics related to flagged instances – what conditions or data attributes relate to outliers? Domain expertise aids greatly in unraveling reasons.

Quantify impact

Run controlled experiments – how do results change when retaining outliers vs removing them vs transforming the variables? Understanding influence can guide handling.

Apply data transformations

Moderating outlier effects by clipping extreme values or logarithmic scaling retains their signals without letting them dominate analysis.

Build robust models

Algorithms like RANSAC and Theil-Sen handle outliers by being less sensitivity to extremes. Similarly tune model hyperparameters appropriately.

Continuously monitor

Tracking outlier indices over time instead of one-off detection helps identify developing issues early for course correction.

In summary, don‘t rush to expunge or ignore every aberrant observation without attempting to extract useful intelligence. Paired with contextual understanding of data and business, outliers contain valuable signals on uncertainties that most standard tools gloss over!

Why Care About Outliers At All?

If outliers cause such analytical headaches, why care about handling them specially vs just removing them? Valid question!

Proper outlier detection and analysis upholds the integrity of data science results in the following ways:

No more misleading averages

Dropping outliers may artificially lower/raise averages giving misleading perceptions of typical behavior. But capturing their niche existence paints an unbiased picture.

Understand natural data variability

Genuine outliers represent tail events and zero-probability occurrences. No amount of modeling can predict their emergence. Understanding their impact metrics informs buffer planning for uncertainties.

Improved model generalization

Machine learning models trained on clean outlier-screened data learn actual signal better and apply more robustly on unseen data.

Earlier anomaly detection

Analyzing known classes of outliers on streaming data can help detect new anomalies like fraud sooner before impacts compound.

In conclusion, outlier detection and analysis serves a crucial role in ensuring data science lives up to its promise!

So there you have it – a comprehensive walk-through on identifying and deriving value from outliers based on point/group/contextual deviations. I hope you‘re now well equipped to handle those pesky outliers hindering your data science success!

Did you find this guide helpful? I welcome your thoughts and feedback in comments below!