Sometimes we only want to work with part of the data in a DataFrame. In this post, we explore the different ways that Pandas gives us to filter the data we want.
This post is part of my journey to learn Python. You find the code for this post in my PythonFriday repository on GitHub.
Sample data
For this post we create a dictionary with our sample data that we turn into a DataFrame:
1 2 3 4 5 6 7 8 9 |
import pandas as pd projects = { 'title': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'], 'technology': ['C#', 'C#', 'Python', 'Python', 'Java', 'JavaScript', 'Python', 'Java'], 'lines': [1000, 4000, 500, 750, 15000, 2500, 6000, 4500] } df = pd.DataFrame(projects) |
This gives us a DataFrame with these values:
Filter rows by column value
We can ask Pandas for a single column of the DataFrame with this syntax:
1 |
df['lines'] |
On this list we can run Boolean operations, like checking if the lines are larger than 500:
1 |
df['lines'] > 500 |
We can now take this filter and put it between the [] of the DataFrame variable to get all the rows who match our filter criteria:
1 |
df[df['lines'] > 500] |
Column name shortcut
As long as our column names do not collide with a method on the DataFrame instance, we can write df.lines instead of df[‘lines’]:
1 |
df[df.lines > 500] |
Filter by text
When we want to find rows by searching for a specific text, we can use the .str accessor and its methods. To find rows where a cell contains a specific text, we can use the .str.contains() method:
1 |
df[df.technology.str.contains('Java')] |
This gives us the rows containing the word Java in the technology column:
To search for an exact match, we can use the == operator:
1 |
df[df.technology == 'Java'] |
This gives us only the projects that use Java, but not the ones that use JavaScript:
Should we use a column that conflicts with a method on the DataFrame, we can use the longer version to get the same result:
1 |
df[df['technology'] == 'Java'] |
Filter with multiple criteria
We can combine filters with & (and) or with | (or) to filter by multiple criteria at once. While the () are often optional, it may be a great help to understand what parts you compare:
1 2 3 4 5 |
df[( (df.technology.str.contains('Java')) & (df.lines <= 5000) )] |
We can skip the () for the part of the .str accessor, but we must keep the () around the lines filter:
1 |
df[df.technology.str.contains('Java') & (df.lines < 5000)] |
If we remove the () around the lines filter, the filter no longer compares what we expect and we get everything back:
1 2 |
# no () => not the result you expect df[df.technology.str.contains('Java') & df.lines < 5000] |
This behaviour is especially tricky, because on its own the filter for less than 5000 lines works as expected:
1 |
df[df.lines < 5000] |
Therefore, better use some additional ()
and get the expected result as saving two characters and end up with the wrong answer.
The query() method
Another flexible way to filter data is the query() method. Here we can directly use the column names and combine our filters as we like:
1 |
df.query('technology == "Python" and lines > 500') |
This gives us the Python projects with more than 500 lines:
We can also use or to combine the matches of two filters:
1 |
df.query('technology == "Java" or lines > 5000') |
This gives us the projects that have more than 5000 lines or are Java projects:
The filter() method
The filter() method comes last because it offers the least functionality. With filter() we can only search in the index column or the header row:
1 |
df.filter(items=['title', 'lines'], axis=1) |
This gives us the two columns title and lines:
If we set axis = 0, we can search for values in the index column and get the lines 1, 2, and 3:
1 |
df.filter(items=[1, 2, 3], axis=0) |
A great help is the parameter like, that we can use to filter for columns (or indexes) that contain a certain value:
1 |
df.filter(like='t') |
Next
With the different filter options, we can get a subset of our DataFrame exactly matching the data we need. That gives us a lot less work to bring data in shape before we can plot them.
Next week it is time to go on the practical side and create a whole workflow from extracting data, to cleaning it up so that we can create meaningful plots to test a hypothesis.
2 thoughts on “Python Friday #186: Filter Data in Pandas”