Python Friday #179: Explore Your Data With Pandas

So far, I printed the Pandas data frame in Jupyter to see what data is in it. That works for small data frames but is rather useless for large ones. Let us find better ways to explore our data in Pandas.

This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.

Our data set

For this post we use the diamonds data set from the ggplot2 project. This is a data visualisation project for R and comes with some interesting data sets.

We can read the diamonds.csv file into a data frame with this code:

import pandas as pd
import os

data_file = os.path.abspath(os.path.join(os.getcwd(), 'data', 'diamonds.csv'))

df = pd.read_csv(data_file)

import pandas as pd

import os

data_file = os.path.abspath(os.path.join(os.getcwd(), 'data', 'diamonds.csv'))

df = pd.read_csv(data_file)

What columns are in our data frame?

With the info() method we get a first idea about our data frame:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB

df.info()

RangeIndex: 53940 entries, 0 to 53939

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 carat 53940 non-null float64

1 cut 53940 non-null object

2 color 53940 non-null object

3 clarity 53940 non-null object

4 depth 53940 non-null float64

5 table 53940 non-null float64

6 price 53940 non-null int64

7 x 53940 non-null float64

8 y 53940 non-null float64

9 z 53940 non-null float64

dtypes: float64(6), int64(1), object(3)

memory usage: 4.1+ MB

We see that our data frame has 53940 entries and 10 columns.

Show the shape of our data

If we only care about the numbers for the columns and rows, we can use the shape property:

df.shape

(53940, 10)

df.shape

(53940, 10)

This tells us also that there are 53940 rows in 10 columns.

Show the beginning of our data

With the head() method we get the first 5 rows of our data frame:

df.head()

df.head()

The first 5 rows of the diamonds data frame

Show the end of our data

With the tail() method we get the last 5 rows of our data frame:

df.tail()

df.tail()

The last 5 rows of the diamonds data frame

Get a random sample of our data

We can use the sample() method to get a single random row of our data frame. If we prefer a larger sample, we can specify how many rows we want:

df.sample(6)

1	df.sample(6)

Six randomly selected rows from our data frame.

Show the distribution of the data

For the numerical columns we can use the describe() method to learn about the statistical details of their values (like min, max, mean, etc.):

df.describe()

1	df.describe()

IMAGE: We get the count, the mean, the 25%-, 50%- and 75% percentile, the minimum and the maximum for each numerical column.

Count the values in a column

For the categorical columns, the describe() method is not that useful. We get much better results with the value_counts() method for the categorical columns we are interested in:

df['cut'].value_counts()

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64


df['color'].value_counts()

G    11292
E     9797
F     9542
H     8304
D     6775
I     5422
J     2808
Name: color, dtype: int64


df['clarity'].value_counts()

SI1     13065
VS2     12258
SI2      9194
VS1      8171
VVS2     5066
VVS1     3655
IF       1790
I1        741
Name: clarity, dtype: int64

df['cut'].value_counts()

Ideal 21551

Premium 13791

Very Good 12082

Good 4906

Fair 1610

Name: cut, dtype: int64

df['color'].value_counts()

G 11292

E 9797

F 9542

H 8304

D 6775

I 5422

J 2808

Name: color, dtype: int64

df['clarity'].value_counts()

SI1 13065

VS2 12258

SI2 9194

VS1 8171

VVS2 5066

VVS1 3655

IF 1790

I1 741

Name: clarity, dtype: int64

With the methods described in this post we can get a first understanding about our data set. As you notice when your run the commands on your machine, the results come in nearly instantly. That speed of results is a great help if you played around in your Jupyter notebook and are not sure if you have the right data in your df variable.

Next week we look how Seaborn can give us a visual aide to understand connections between the different attributes of our data.

Python Friday #179: Explore Your Data With Pandas

Our data set

What columns are in our data frame?

Show the shape of our data

Show the beginning of our data

Show the end of our data

Get a random sample of our data

Show the distribution of the data

Count the values in a column

Next

Like this:

Related

1 thought on “Python Friday #179: Explore Your Data With Pandas”

Leave a Comment Cancel reply

Our data set

What columns are in our data frame?

Show the shape of our data

Show the beginning of our data

Show the end of our data

Get a random sample of our data

Show the distribution of the data

Count the values in a column

Next

Share this:

Like this:

Related

1 thought on “Python Friday #179: Explore Your Data With Pandas”

Leave a Comment Cancel reply