Python Friday #180: Explore Your Data With Seaborn

Seaborn is a statistical data visualization library based on Matplotlib. We will use Seaborn to find interesting relationships in our data and turn them into informative graphs. As you will see shortly, this involves less code than if we were to use Matplotlib directly.

This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.

 

Installation

We can install Seaborn with this command:

If you want to deep-dive and get everything of the advanced features in one go, you can use this command:

 

Included data sets

Seaborn comes with a list of built-in data sets. We can find that list with the get_dataset_names() method:

[‘anagrams’,
‘anscombe’,
‘attention’,
‘brain_networks’,
‘car_crashes’,
‘diamonds’,
‘dots’,
‘dowjones’,
‘exercise’,
‘flights’,
‘fmri’,
‘geyser’,
‘glue’,
‘healthexp’,
‘iris’,
‘mpg’,
‘penguins’,
‘planets’,
‘seaice’,
‘taxis’,
‘tips’,
‘titanic’]

You can find the matching *.csv files for the data sets on GitHub, should you want to use them outside of Seaborn.

 

Load a data set

We can load the sample data with the load_dataset() method:

This gives us a Pandas data frame with the same data as we used in the last post.

 

Pair plots

One of the most helpful visualisations in Seaborn is the pair plot. We can use the pairplot() method to create a pairwise plot of all numerical columns in our data frame. For the diamonds data set, it takes the 7 numerical attributes and creates 49 plots (= 7×7) in one figure:

This gives us a graphical overview on the different columns and how they influence each other:

Each numerical column is plotted against all other numerical columns.
(Click on the image to enlarge it)

I changed the marker and the line width to better handle the overlap of the data. You can try the default look by skipping the plot_kws parameter.

This kind of plot allows us to quickly confirm or reject assumptions we have about the data. If we expect that higher carat (the unit to weight a diamond) means a higher price, we can see that it is only partially true. The highest prices got diamonds in the range of 1-3 carats and not the 4+ carat diamonds:

The highest prices are not paid for the diamonds with the most carat.

We can use the pair plot also to spot data errors. The x value should stand for the length of a diamond. How can it be that there are diamonds with a length of 0 getting so high prices?

There are multiple diamonds with a length of 0 and among the highest paid prices.

 

Scatter plots with 4 attributes

Especially with complex data it is often helpful to use more than two dimensions in a plot. Since we are limited with 2-dimensional plots to 2 axes, we can use the colour and the shape of our points to transport additional information. We can use the scatterplot() method and not only define the x and y axes, but also the size and the hue (to influence the colour):

This creates a colourful picture that gives us a better understanding on how the price of a diamond is formed:

A scatterplot for price and carat with the colour based on the cut and the size of the dots based on the clarity.
(Click on the image to enlarge it)

The value IF for clarity stands for the best, while I1 is the worst (you can find the explanation here). If we check the plot for blue (ideal cut) and big dots for the clarity, we see that those attributes are more important for the price than the carats.

If you want to explore more details of the diamonds data set, you should read the examples from Kartikay Bagla on Kaggle.

 

Next

With Seaborn we can get a quick overview on large data sets like the one for the diamonds with 54’000 rows. Next week we will look at how Seaborn can help us better understand our data.

4 thoughts on “Python Friday #180: Explore Your Data With Seaborn”

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.