Python Friday #96: Access Your Git Repository With PyDriller

Your Git repository is a treasure trove of data about your project. Every commit you make tells a part of the story of your project. Would it not be great to run some analysis on it to see hidden pattern or learn about the dynamics at work?

This post is part of my journey to learn Python. You can find the other parts of this series here. You find the code for this post in my PythonFriday repository on GitHub.

 

PyDriller

With PyDriller we can access our Git repositories with Python code. We do not need to know the implementation details of Git; instead, we use the high-level API of PyDriller to find the interesting parts.

You can install PyDriller with this command:

 

Traverse commits

PyDriller allows us to traverse all commits with the traverse_commits() method in our repository. This is a good starting point to learn PyDriller and explore your repository in code.

The Repository class can point to a local copy or a remote address. The Commit object we get back has fields for all the attributes of a Git commit (like author, commit date, hash, and so forth).

This produces an output like this:

2021-10-22 23:30:00+02:00 – c816a116 – Johnny Graber
2021-10-23 23:15:00+02:00 – 4f379c7a – Johnny Graber
2021-10-24 11:18:10+02:00 – 64ec912d – Johnny Graber
2021-10-25 15:37:05+02:00 – 0a9b1109 – Johnny Graber

A bit more interesting is the field modified_files, that offers us a list of the changed files in this commit:

The output lists the changed files and how they changed:

– 5867629bf749_add_created_date_with_default_value_to_.py has changed (ADD)
– cd496010379c_remove_constraints_for_examples.py has changed (ADD)
– dff2171f8b62_create_index_for_employee_last_name_and_.py has changed (ADD)
– e4887a8bedc7_add_index_to_publisher_name.py has changed (ADD)
– book.py has changed (MODIFY)
– employee.py has changed (MODIFY)
– publisher.py has changed (MODIFY)

 

How often changed a file?

The CommitsCount() class allows us to take a range of commits to count how often a file got changed:

This gives us a dictionary with the files and their number of changes back:

{‘SQLAlchemy\\Core\\sqlalchemy_core_crud.py’: 2,
‘SQLAlchemy\\Core\\sqlalchemy_core_filters.py’: 3,
‘SQLAlchemy\\Core\\sqlalchemy_core_relationships.py’: 1,
‘SQLAlchemy\\ORM\\sqlalchemy_orm_relationships.py’: 1}

We can use this information to plan the next refactoring session. Files that change often tend to require more changes in the future. Therefore, cleaning up those files makes the next changes a lot simpler.

 

How much changed in a file?

With the LinesCount() class we can use a range of commits and get a counter for all added and removed lines per file.

As with CommitsCount() we get a dictionary with the file path as key and the number of lines as the value:

Total lines added per file:
{‘SQLAlchemy\\Core\\sqlalchemy_core_crud.py’: 6,
‘SQLAlchemy\\Core\\sqlalchemy_core_filters.py’: 223,
‘SQLAlchemy\\Core\\sqlalchemy_core_relationships.py’: 200,
‘SQLAlchemy\\ORM\\sqlalchemy_orm_relationships.py’: 65}
Total lines removed per file:
{‘SQLAlchemy\\Core\\sqlalchemy_core_crud.py’: 136,
‘SQLAlchemy\\Core\\sqlalchemy_core_filters.py’: 7,
‘SQLAlchemy\\Core\\sqlalchemy_core_relationships.py’: 0,
‘SQLAlchemy\\ORM\\sqlalchemy_orm_relationships.py’: 0}

 

More to explore

The official documentation of PyDriller has many more methods to offer to dig even deeper into your repository. If you think of a question you have about your repository you most likely find an answer there.

 

Conclusion

PyDriller is a great help when you want to look at your Git repository and extract interesting data. Try it if you want to analyse your repository.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.