guarddog Repo Integrity Mismatch

This check resembles very much what we have attempted a few years back, that is, to compare the (Python) files in a PyPI package with the corresponding files in the source code repo. In more detail, we tried to identify individual lines and checked whether they contain suspicious Python calls.

However, my take-away of our experiments was that there are many differences, which render such checks very noisy.

From the paper LastPyMile: identifying the discrepancy between sources and packages: "Figure 5 shows that 65% of artifacts and 22% of files present in PyPI have changes with respect to the source code repository."

Would it possible to share your feedback on the check's precision?

Cheers, Henrik

PS: You can find the PDF also on Google Scholar.

May 11 '23 09:05 henrikplate

Hello, thanks for the great question!

We did find the check noisy at first, which is why we only take into account more opinionated use-cases:

Exclude some file extensions https://github.com/DataDog/guarddog/blob/main/guarddog/analyzer/metadata/pypi/repository_integrity_mismatch.py#L133
Only flag files that are on GitHub and in the package tarball but don't have the same hash

@vdeturckheim was the original implementer, in case he wants to give more context. Overall we acknowledge that this check is an heuristic and by no means perfect, but your feedback/thoughts are welcome!

May 11 '23 10:05 christophetd

It would make sense to combine the checks you already have to further reduce noise. For example, you could run your semgrep rules (maybe even more relaxed ones) only on those files that differ between package and repo. Using the line number info from semgrep results, you could filter only those findings that concern code only existing in the package.

May 11 '23 15:05 henrikplate