So far, I printed the Pandas data frame in Jupyter to see what data is in it. That works for small data frames but is rather useless for large ones. Let us find better ways to explore our data in Pandas.
This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.
Our data set
For this post we use the diamonds data set from the ggplot2 project. This is a data visualisation project for R and comes with some interesting data sets.
We can read the diamonds.csv file into a data frame with this code:
1 2 3 4 5 6 |
import pandas as pd import os data_file = os.path.abspath(os.path.join(os.getcwd(), 'data', 'diamonds.csv')) df = pd.read_csv(data_file) |
What columns are in our data frame?
With the info() method we get a first idea about our data frame:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 53940 entries, 0 to 53939 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 carat 53940 non-null float64 1 cut 53940 non-null object 2 color 53940 non-null object 3 clarity 53940 non-null object 4 depth 53940 non-null float64 5 table 53940 non-null float64 6 price 53940 non-null int64 7 x 53940 non-null float64 8 y 53940 non-null float64 9 z 53940 non-null float64 dtypes: float64(6), int64(1), object(3) memory usage: 4.1+ MB |
We see that our data frame has 53940 entries and 10 columns.
Show the shape of our data
If we only care about the numbers for the columns and rows, we can use the shape property:
1 2 3 |
df.shape (53940, 10) |
This tells us also that there are 53940 rows in 10 columns.
Show the beginning of our data
With the head() method we get the first 5 rows of our data frame:
1 |
df.head() |
Show the end of our data
With the tail() method we get the last 5 rows of our data frame:
1 |
df.tail() |
Get a random sample of our data
We can use the sample() method to get a single random row of our data frame. If we prefer a larger sample, we can specify how many rows we want:
1 |
df.sample(6) |
Show the distribution of the data
For the numerical columns we can use the describe() method to learn about the statistical details of their values (like min, max, mean, etc.):
1 |
df.describe() |
Count the values in a column
For the categorical columns, the describe() method is not that useful. We get much better results with the value_counts() method for the categorical columns we are interested in:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
df['cut'].value_counts() Ideal 21551 Premium 13791 Very Good 12082 Good 4906 Fair 1610 Name: cut, dtype: int64 df['color'].value_counts() G 11292 E 9797 F 9542 H 8304 D 6775 I 5422 J 2808 Name: color, dtype: int64 df['clarity'].value_counts() SI1 13065 VS2 12258 SI2 9194 VS1 8171 VVS2 5066 VVS1 3655 IF 1790 I1 741 Name: clarity, dtype: int64 |
Next
With the methods described in this post we can get a first understanding about our data set. As you notice when your run the commands on your machine, the results come in nearly instantly. That speed of results is a great help if you played around in your Jupyter notebook and are not sure if you have the right data in your df variable.
Next week we look how Seaborn can give us a visual aide to understand connections between the different attributes of our data.
1 thought on “Python Friday #179: Explore Your Data With Pandas”