Exploratory Analysis, Data Visualization, and Summary Statistics

Exploratory Analysis

Welcome to our mini-course on data science and applied machine learning!

In the previous overview, we saw a bird's eye view of the entire machine learning workflow. We saw how the "80/20" of data science includes 5 core steps.

In this overview, we will dive into the first of those core steps: exploratory analysis.

This step should not be confused with data visualization or summary statistics. Those are merely tools... means to an end.

Proper exploratory analysis is about answering questions. It's about extracting enough insights from your dataset to course correct before you get lost in the weeds.

In this guide, we explain which insights to look for. Let's get started.

Why explore your dataset upfront?

The purpose of exploratory analysis is to "get to know" the dataset. Doing so upfront will make the rest of the project much smoother, in 3 main ways:

  1. You’ll gain valuable hints for Data Cleaning (which can make or break your models).
  2. You’ll think of ideas for Feature Engineering (which can take your models from good to great).
  3. You’ll get a "feel" for the dataset, which will help you communicate results and deliver greater impact.

However, exploratory analysis for machine learning should be quick, efficient, and decisive... not long and drawn out!

Don’t skip this step, but don’t get stuck on it either.

You see, there are infinite possible plots, charts, and tables, but you only need a handful to "get to know" the data well enough to work with it.

In this lesson, we'll show you the visualizations that provide the biggest bang for your buck.

Start with Basics

First, you'll want to answer a set of basic questions about the dataset:

  • How many observations do I have?
  • How many features?
  • What are the data types of my features? Are they numeric? Categorical?
  • Do I have a target variable?

Basic Information

Know what you're working with.

Example observations

Then, you'll want to display example observations from the dataset. This will give you a "feel" for the values of each feature, and it's a good way to check if everything makes sense.

Here's an example from the real-estate dataset:

tx_price beds baths sqft year_built lot_size property_type exterior_walls roof basement restaurants groceries nightlife cafes shopping arts_entertainment beauty_spas active_life median_age married college_grad property_tax insurance median_school num_schools tx_year
0 295850 1 1 584 2013 0 Apartment / Condo / Townhouse Wood Siding NaN NaN 107 9 30 19 89 6 47 58 33.0 65.0 84.0 234.0 81.0 9.0 3.0 2013
1 216500 1 1 612 1965 0 Apartment / Condo / Townhouse Brick Composition Shingle 1.0 105 15 6 13 87 2 26 14 39.0 73.0 69.0 169.0 51.0 3.0 3.0 2006
2 279900 1 1 615 1963 0 Apartment / Condo / Townhouse Wood Siding NaN NaN 183 13 31 30 101 10 74 62 28.0 15.0 86.0 216.0 74.0 8.0 3.0 2012
3 379900 1 1 618 2000 33541 Apartment / Condo / Townhouse Wood Siding NaN NaN 198 9 38 25 127 11 72 83 36.0 25.0 91.0 265.0 92.0 9.0 3.0 2005
4 340000 1 1 634 1992 0 Apartment / Condo / Townhouse Brick NaN NaN 149 7 22 20 83 10 50 73 37.0 20.0 75.0 88.0 30.0 9.0 3.0 2002

The purpose of displaying examples from the dataset is not to perform rigorous analysis. Instead, it's to get a qualitative "feel" for the dataset.

  • Do the columns make sense?
  • Do the values in those columns make sense?
  • Are the values on the right scale?
  • Is missing data going to be a big problem based on a quick eyeball test?

Plot Numerical Distributions

Next, it can be very enlightening to plot the distributions of your numeric features.

Often, a quick and dirty grid of histograms is enough to understand the distributions.

Here are a few things to look out for:

  • Distributions that are unexpected
  • Potential outliers that don't make sense
  • Features that should be binary (i.e. "wannabe indicator variables")
  • Boundaries that don't make sense
  • Potential measurement errors

At this point, you should start making notes about potential fixes you'd like to make. If something looks out of place, such as a potential outlier in one of your features, now's a good time to ask the client/key stakeholder, or to dig a bit deeper.

However, we'll wait until Data Cleaning to make fixes so that we can keep our steps organized.

Histogram grid

Plot Categorical Distributions

Categorical features cannot be visualized through histograms. Instead, you can use bar plots.

In particular, you'll want to look out for sparse classes, which are classes that have a very small number of observations.

By the way, a "class" is simply a unique value for a categorical feature. For example, the following bar plot shows the distribution for a feature called 'exterior_walls'. So Wood Siding, Brick, and Stucco are each classes for that feature.

Bar plots

Anyway, back to sparse classes... as you can see, some of the classes for 'exterior_walls' have very short bars. Those are sparse classes.

They tend to be problematic when building models.

  • In the best case, they don't influence the model much.
  • In the worse case, they can cause the model to be overfit.

Therefore, we recommend making a note to combine or reassign some of these classes later. We prefer saving this until Feature Engineering (Lesson 4).

Plot Segmentations

Segmentations are powerful ways to observe the relationship between categorical features and numeric features.

Box plots allow you to do so.

Here are a few insights you could draw from the following chart.

  • The median transaction price (middle vertical bar in the box) for Single-Family homes was much higher than that for Apartments / Condos / Townhomes.
  • The min and max transaction prices are comparable between the two classes.
  • In fact, the round-number min ($200k) and max ($800k) suggest possible data truncation...
  • ...which is very important to remember when assessing the generalizability of your models later!
Box plots

Study Correlations

Finally, correlations allow you to look at the relationships between numeric features and other numeric features.

Correlation is a value between -1 and 1 that represents how closely two features move in unison. You don't need to remember the math to calculate them. Just know the following intuition:

  • Positive correlation means that as one feature increases, the other increases. E.g. a child’s age and her height.
  • Negative correlation means that as one feature increases, the other decreases. E.g. hours spent studying and number of parties attended.
  • Correlations near -1 or 1 indicate a strong relationship.
  • Those closer to 0 indicate a weak relationship.
  • 0 indicates no relationship.

Correlation heatmaps help you visualize this information. Here's an example (note: all correlations were multiplied by 100):

Correlations heatmap

In general, you should look out for:

  • Which features are strongly correlated with the target variable?
  • Are there interesting or unexpected strong correlations between other features?

Again, your aim is to gain intuition about the data, which will help you throughout the rest of the workflow.

By the end of your Exploratory Analysis step, you'll have a pretty good understanding of the dataset, some notes for data cleaning, and possibly some ideas for feature engineering.

Additional Resources

Ready to roll up your sleeves take your skills to the next level? Join the Machine Learning Accelerator today.

Land Ho!

No one had the heart to tell Jerry that all he discovered was the "Bahamas Bashed Potatoes" weekly special...