Python Friday #181: Making Sense of Your Data With Seaborn

Making sense of the data is an important first step before we can visualise the data. In this post we continue with Seaborn and explore the various ways it can help us to better understand what is going on.

This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.

Prerequisites

Please make sure that you have a working installation of Seaborn. We continue to use the built-in datasets for our graphs and change between them to highlight different ways to better understand our data.

What creates these clusters?

The penguins data set offers us various measurements of penguins. If we run the data through the pairplot() method, we get a graph that looks like this one:

import seaborn as sns
penguins = sns.load_dataset("penguins")
sns.pairplot(penguins)

import seaborn as sns

penguins = sns.load_dataset("penguins")

sns.pairplot(penguins)

The pairplot for the penguins contains clusters. — (Click on the image to enlarge it)

In multiple plots we get two clusters, but why? Let us explore the measurements for flipper_length_mm and bill_depth_mm in a scatterplot(). As a first guess we could think that the gender of the penguins is responsible for the two clusters. We can use the sex property as the value for hue (which sets the colour of our points):

sns.scatterplot(x='flipper_length_mm', 
                y='bill_depth_mm', 
                hue='sex', 
                data=penguins)

sns.scatterplot(x='flipper_length_mm',

y='bill_depth_mm',

hue='sex',

data=penguins)

The two colours for gender are mixed within both clusters

The graph shows that our first idea is wrong. The gender is mixed throughout the two clusters. Let us try the species attribute next:

sns.scatterplot(x='flipper_length_mm', 
                y='bill_depth_mm', 
                hue='species', 
                data=penguins)

sns.scatterplot(x='flipper_length_mm',

y='bill_depth_mm',

hue='species',

data=penguins)

One cluster is now exclusively in green for the Gentoo species.

That looks much better. We can see that the Gentoo species explains one of our two clusters perfectly.

With this newly gained insight, we can revisit the pairplot() method and set the hue based on the species:

sns.pairplot(penguins, hue='species')

1	sns.pairplot(penguins, hue='species')

We now see that the Gentoo species is usually responsible for a cluster. — (Click on the image to enlarge it)

The different species are clustered together and explain the pattern in our graphs. While we could start with the hue property right at the beginning, it is often much faster to start with a smaller plot and iterate there until we find a good match.

Get rid of overlapping data points

If we have many overlapping data points, we cannot see the true amount of data for a given value. We find this problem with the tips data set, which contains the data of tips given to waiters in a restaurant.

tips = sns.load_dataset("tips")

sns.scatterplot(x="day", 
                y="tip", 
                hue="sex", 
                data=tips)

tips = sns.load_dataset("tips")

sns.scatterplot(x="day",

y="tip",

hue="sex",

data=tips)

The values overlap each other, and we cannot see much

This chart shows us a line of points per day, but we cannot see the amount of the majority of tips that were given. If we change from the scatterplot() to the swarmplot() method, we can see a lot more details:

sns.swarmplot(x="day", 
              y="tip", 
              hue="sex", 
              data=tips)

sns.swarmplot(x="day",

y="tip",

hue="sex",

data=tips)

We now have little trees that show us most points without an overlap.

It now looks a bit like a tree, and we can see most of the data points. This graph allows us to understand much better what is going on and what amounts of tips where given – as long as there are not too many points.

Visualising a table full of values

If we have a table full of values, it is not easy to quickly spot where the highest values are. Seaborn offers us a heatmap() method to solve this problem. We can use the flights data set and create a pivot that gives us the number of passengers per month and year:

flights_monthly = sns.load_dataset("flights")

flights = flights_monthly.pivot(index="month", 
                                columns="year", 
                                values="passengers")

f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(flights, 
            annot=True, 
            fmt="d", 
            linewidths=.25, 
            ax=ax)

flights_monthly = sns.load_dataset("flights")

flights = flights_monthly.pivot(index="month",

columns="year",

values="passengers")

f, ax = plt.subplots(figsize=(9, 6))

sns.heatmap(flights,

annot=True,

fmt="d",

linewidths=.25,

ax=ax)

We can see that the most passengers flew in July 1960.

This gives us all the numerical values and the background colour changes with the number of passengers. With this heatmap it is easy to spot July 1960 as the month with the most passengers.

By changing a few parameters or switching to a more useful visualisation we can leverage Seaborn to figure out what is going on in our data. The examples in this post are by no means a complete list of things you can do. Use these ideas as your starting point and explore the built-in data sets on your own.

Next week we look at the various ways we can style our plots in Seaborn.

Python Friday #181: Making Sense of Your Data With Seaborn

Prerequisites

What creates these clusters?

Get rid of overlapping data points

Visualising a table full of values

Next

Like this:

Related

1 thought on “Python Friday #181: Making Sense of Your Data With Seaborn”

Leave a Comment Cancel reply

Prerequisites

What creates these clusters?

Get rid of overlapping data points

Visualising a table full of values

Next

Share this:

Like this:

Related

1 thought on “Python Friday #181: Making Sense of Your Data With Seaborn”

Leave a Comment Cancel reply