Scanning The Top 1000 Python Packages Using GuardDog

Ahmed Musaad
Ahmed Musaad
Scanning The Top 1000 Python Packages Using GuardDog
Photo by Alex Chumak / Unsplash

Supply chain attacks are a nightmare, both in logistics and software development. With the increasing dependency on third-party libraries, frameworks, and packages, the risk of data breach or system compromise is heightened.

There are no easy solutions for this problem, as we can't just say no to third-party packages. Such a decision would negatively impact development speed and could potentially put a business at a disadvantage. We can, however, work on preventing such attacks. This post talks about one of the tools that can be used to find malicious indicators in Python packages.

Worldwide software supply chain attacks tracker (updated daily) - Comparitech
Up until the end of 2020, software supply chain attacks might not have been on your radar. But the attack on SolarWinds, as well as those on Log4j, Codecov, and Kaseya (to name a few), have placed these emerging threats at the forefront of many cybersecurity strategies. Designed to cause mass disrup…

Python is a really popular programming language. This year, it was second on GitHub's list of Most Used Programming Languages and fourth on Stack Overflow's list of Most Popular Technologies. Python Package Index, the official package manager for Python, has over 400,000 packages and served 19,097,015,960 packaged downloads last month alone. However, with popularity comes attention, and it's no surprise that Python had its fair share of supply chain attacks (one example).

There are, of course, many steps that can be taken to prevent, detect, and handle such attacks. Limiting the number of third-party packages used is a good start. Another measure is to ensure that the packages are checked and that security advisories are followed. Moreover, we can scan packages before using them to see if they include any malicious indicators. Here, Datadog's new open-source tool GuardDog comes to the rescue.

GuardDog is a CLI tool that allows to identify malicious PyPI packages. It runs a set of heuristics on the package source code (through Semgrep rules) and on the package metadata.

When I saw the blog post about this new tool, I was rather excited to give it a try. I previously wrote a post about Semgrep, and so I was excited to try this new tool. So, I decided to spend some of my time looking at the top 1000 python packages and see if there is anything interesting.

Install The Tool

The installation is easy. Just run the installation command that is listed in the repo, and once it is done, you are ready to go. Remember that the tool requires Python 3.10, so you may need to upgrade your Python version if you intend to install it using PIP.

pip install guarddog

Find A List of Packages

Next, we need to find the list of the top 1000 packages, lucky for us, there is a fantastic website that can help (link). I simply selected 'Show 1000' and copied the data. I then did a bit of cleaning and voilà, we have a list of the top 1000 Python packages ready to be analysed by GuardDog.

Analyse The Packages

I performed this analysis using my Windows computer, and I encountered a couple of issues when trying to use the tool after installing it using PIP. While I could have explored these and sorted them out, I decided to take the easy way and use the docker image instead. Once we have the docker image downloaded, we can use the following command:

 ForEach ($line in Get-Content libs.txt) {
 	docker run --rm ghcr.io/datadog/guarddog scan $line
 }

The Results

ℹ️
It's important to remember that just because GuardDog finds a suspicious indicator, it doesn't mean that the package is bad. Given the possibility for false positives, you should use the output more as a starting point for your investigation and less than a final, confirmed result.

Out of the 1,000 packages scanned, the tool detected suspicious indicators in 100 packages (10%). The detected indicators were distributed as follows:

Many of the detected indicators appear harmless (lack of package description on PyPi, etc.). I found the 'potentially compromised email domain' detections to be rather interesting, as they tie to a rather nasty supply chain attack.

It is possible for malicious actors to compromise the email accounts of maintainers by using domain names that have been used for their email address if the registration expires and the maintainer doesn't renew it in time. This could lead to a compromise of one or more of the packages published by the maintainer.

The results for the top 100 packages look okay, but that's only about 0.2% of the total number of packages available on PyPi. I don't think all of them will get a similar result. There are some research and analysis opportunities, but I'll leave that for someone else.

Supply chain attacks are a serious threat to most software written nowadays, including software used for critical infrastructure. Having more open-source tools that simplify the process of scanning packages for malicious indicators will undoubtedly help us detect and prevent such attacks. I would like to applaud GuardDog for their work on GuardDog and for making it available to everyone.



Great! Next, complete checkout for full access to Ahmed Musaad
Welcome back! You've successfully signed in
You've successfully subscribed to Ahmed Musaad
Success! Your account is fully activated, you now have access to all content
Success! Your billing info has been updated
Your billing was not updated