Python Friday #186: Filter Data in Pandas

Sometimes we only want to work with part of the data in a DataFrame. In this post, we explore the different ways that Pandas gives us to filter the data we want.

This post is part of my journey to learn Python. You find the code for this post in my PythonFriday repository on GitHub.

Sample data

For this post we create a dictionary with our sample data that we turn into a DataFrame:

import pandas as pd

projects = {
    'title': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'],
    'technology': ['C#', 'C#', 'Python', 'Python', 'Java', 'JavaScript', 'Python', 'Java'],
    'lines': [1000, 4000, 500, 750, 15000, 2500, 6000, 4500]
}

df = pd.DataFrame(projects)

import pandas as pd

projects = {

'title': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'],

'technology': ['C#', 'C#', 'Python', 'Python', 'Java', 'JavaScript', 'Python', 'Java'],

'lines': [1000, 4000, 500, 750, 15000, 2500, 6000, 4500]

}

df = pd.DataFrame(projects)

This gives us a DataFrame with these values:
This gives us projects a to f and shows their title, technology, and number of lines.

Filter rows by column value

We can ask Pandas for a single column of the DataFrame with this syntax:

df['lines']

1	df['lines']

We only get the column lines back from our DataFrame.

On this list we can run Boolean operations, like checking if the lines are larger than 500:

df['lines'] > 500

1	df['lines'] > 500

We get a list with True or False depending on the number of lines being larger than 500

We can now take this filter and put it between the [] of the DataFrame variable to get all the rows who match our filter criteria:

df[df['lines'] > 500]

1	df[df['lines'] > 500]

We only get the rows back that are larger than 500 lines.

Column name shortcut

As long as our column names do not collide with a method on the DataFrame instance, we can write df.lines instead of df[‘lines’]:

df[df.lines > 500]

1	df[df.lines > 500]

We again only get the rows back that are larger than 500 lines.

Filter by text

When we want to find rows by searching for a specific text, we can use the .str accessor and its methods. To find rows where a cell contains a specific text, we can use the .str.contains() method:

df[df.technology.str.contains('Java')]

1	df[df.technology.str.contains('Java')]

This gives us the rows containing the word Java in the technology column:
We get all rows that have Java or JavaScript as the value for technology.

To search for an exact match, we can use the == operator:

df[df.technology == 'Java']

1	df[df.technology == 'Java']

This gives us only the projects that use Java, but not the ones that use JavaScript:
Only Java projects are found, but not the ones that use JavaScript

Should we use a column that conflicts with a method on the DataFrame, we can use the longer version to get the same result:

df[df['technology'] == 'Java']

1	df[df['technology'] == 'Java']

Filter with multiple criteria

We can combine filters with & (and) or with | (or) to filter by multiple criteria at once. While the () are often optional, it may be a great help to understand what parts you compare:

df[(
    (df.technology.str.contains('Java')) 
    & 
    (df.lines <= 5000)
)]

df[(

(df.technology.str.contains('Java'))

(df.lines <= 5000)

)]

We get the two projects for Java and JavaScript with less than 5000 lines

We can skip the () for the part of the .str accessor, but we must keep the () around the lines filter:

df[df.technology.str.contains('Java') & (df.lines < 5000)]

1	df[df.technology.str.contains('Java') & (df.lines < 5000)]

We get the two projects for Java and JavaScript with less than 5000 lines

If we remove the () around the lines filter, the filter no longer compares what we expect and we get everything back:

# no () => not the result you expect
df[df.technology.str.contains('Java') & df.lines < 5000]

1 2	# no () => not the result you expect df[df.technology.str.contains('Java') & df.lines < 5000]

We now get everything back instead the expected two projects.

This behaviour is especially tricky, because on its own the filter for less than 5000 lines works as expected:

df[df.lines < 5000]

1	df[df.lines < 5000]

We only get projects with less than 5000 lines back.

Therefore, better use some additional () and get the expected result as saving two characters and end up with the wrong answer.

The query() method

Another flexible way to filter data is the query() method. Here we can directly use the column names and combine our filters as we like:

df.query('technology == "Python" and lines > 500')

1	df.query('technology == "Python" and lines > 500')

This gives us the Python projects with more than 500 lines:
The two Python projects with more than 500 lines

We can also use or to combine the matches of two filters:

df.query('technology == "Java" or lines > 5000')

1	df.query('technology == "Java" or lines > 5000')

This gives us the projects that have more than 5000 lines or are Java projects:
The result contains 3 projects (2x Java, 1x Python with more than 5000 lines).

The filter() method

The filter() method comes last because it offers the least functionality. With filter() we can only search in the index column or the header row:

df.filter(items=['title', 'lines'], axis=1)

1	df.filter(items=['title', 'lines'], axis=1)

This gives us the two columns title and lines:

The result contains all rows of the DataFrame, but only the two columns title and lines.

If we set axis = 0, we can search for values in the index column and get the lines 1, 2, and 3:

df.filter(items=[1, 2, 3], axis=0)

1	df.filter(items=[1, 2, 3], axis=0)

We get the whole rows but only for rows with the index 1, 2, and 3.

A great help is the parameter like, that we can use to filter for columns (or indexes) that contain a certain value:

df.filter(like='t')

1	df.filter(like='t')

This gives us a DataFrame with the two columns title and technology, both containing a t.

With the different filter options, we can get a subset of our DataFrame exactly matching the data we need. That gives us a lot less work to bring data in shape before we can plot them.

Next week it is time to go on the practical side and create a whole workflow from extracting data, to cleaning it up so that we can create meaningful plots to test a hypothesis.

2 thoughts on “Python Friday #186: Filter Data in Pandas”

Pingback: Python Friday #184: Performance-Related Magic Commands in Jupyter - Improve & Repeat
Pingback: Python Friday #185: Creating DataFrames in Pandas - Improve & Repeat

Python Friday #186: Filter Data in Pandas

Sample data

Filter rows by column value

Column name shortcut

Filter by text

Filter with multiple criteria

The query() method

The filter() method

Next

Like this:

Related

2 thoughts on “Python Friday #186: Filter Data in Pandas”

Leave a Comment Cancel reply

Sample data

Filter rows by column value

Column name shortcut

Filter by text

Filter with multiple criteria

The query() method

The filter() method

Next

Share this:

Like this:

Related

2 thoughts on “Python Friday #186: Filter Data in Pandas”

Leave a Comment Cancel reply