feat(security): Add package name typosquatting detection
Implement typosquatting detection for package names during analysis. Compares package names against a list of popular packages using the Jaro-Winkler similarity algorithm. Packages exceeding a defined threshold of similarity to a popular package are flagged.
Summary
Adds typosquatting detection for package names during analysis using Jaro-Winkler similarity.
Description of changes
This PR introduces a new security analysis feature to detect potential typosquatting in package names. The implementation compares the name of a package being analyzed against a list of popular package names. By default, it uses a predefined list stored in a dedicated file, but it also offers an option to use a custom list provided via a configuration path.
The comparison utilizes the Jaro-Winkler similarity algorithm to calculate a similarity score between the package name and each name in the popular packages list. If the calculated similarity score exceeds a configurable threshold, the package is flagged as a potential typosquat.
This feature helps identify malicious packages attempting to mimic legitimate, popular ones through slight variations in spelling, thus enhancing the security posture of the project by warning users about such risks.
The changes include:
- Integration of the Jaro-Winkler similarity algorithm.
- Inclusion of a default file containing a list of popular package names for comparison.
- Addition of a configuration option to provide a custom file path for the popular packages list, overriding the default.
- Implementation of the comparison logic and threshold-based flagging.
Related issues
Checklist
- [x] I have reviewed the contribution guide.
- [x] My PR title and commits follow the Conventional Commits convention.
- [x] My commits include the "Signed-off-by" line.
- [x] I have signed my commits following the instructions provided by GitHub. Note that we run GitHub's commit verification tool to check the commit signatures. A green
verifiedlabel should appear next to all of your commits on GitHub. - [x] I have tested my changes and verified they work as expected.
Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA). The following contributors of this PR have not signed the OCA:
- PR author: AmineRaouane
- [email protected]
To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.
When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.
If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.
@AmineRaouane Please add unit tests following the instructions here.
Take a look at the unit tests for other malware heuristics at tests/malware_analyzer/pypi/ and add a similar one for this new heuristic.
For small and standalone functions, you can add test cases to the docstring itself. You can find an example here.
Would it be possible to make the path to the custom file list of packages configurable through defaults.ini? Our configurations for heuristic analyzers live under the [heuristic.pypi] section in the file. Check out some of the heuristics that use it to get an idea (anomalous_version.py, high_release_frequency.py), I do something similar with paths in the semgrep PR.
@AmineRaouane As a general comment: Please wait for reviewers to mark the issues they have raised as resolved. This way we can more easily ensure that an appropriate resolution has been reached.