presidio icon indicating copy to clipboard operation
presidio copied to clipboard

Reorganize predefined recognizers into logical subfolders

Open krikera opened this issue 6 months ago • 4 comments

Reorganize predefined recognizers into logical subfolders

  • Move country-specific recognizers to country_specific/{country}/ folders
  • Move generic recognizers to generic/ folder
  • Move NER recognizers to ner/ folder
  • Maintain full backward compatibility for all existing imports
  • Update test imports to use new file locations
  • Add comprehensive documentation for new structure

Fixes #1638

Change Description

This PR reorganizes the predefined recognizers in presidio-analyzer into logical subfolders to improve code maintainability and make it easier for contributors to add new recognizers.

What Changed:

Directory Structure:

  • country_specific/ - Organizes recognizers by country (9 countries: US, UK, India, Italy, Australia, Spain, Finland, Poland, Singapore)
  • generic/ - Contains globally applicable recognizers (Credit Card, Crypto, Date, Email, IBAN, IP, Medical License, Phone, URL)
  • ner/ - Contains Named Entity Recognition based recognizers (SpaCy, Stanza, Transformers, GLiNER, Azure AI Language)

Files Moved:

  • 28 country-specific recognizers moved to appropriate country folders
  • 10 generic recognizers moved to generic/ folder
  • 5 NER recognizers moved to ner/ folder

Backward Compatibility:

  • All existing imports continue to work unchanged (e.g., from presidio_analyzer.predefined_recognizers import CreditCardRecognizer)
  • Main __init__.py updated to import from new locations and re-export all classes
  • No breaking changes for existing users

Documentation:

  • Added comprehensive README.md explaining new structure and contribution guidelines
  • Updated CHANGELOG.md under Unreleased section as required
  • Added __init__.py files to all new directories

Test Updates:

  • Fixed 3 test files that had direct imports to recognizer files
  • All tests now use correct import paths for new structure

Benefits:

  • Better Organization: Clear separation makes codebase more maintainable
  • Easier Contributions: Contributors can easily find where to add new recognizers
  • Scalable: Simple to add new countries or recognizer types
  • Well Documented: Clear guidelines for future development

Issue reference

This PR fixes issue #1638

Checklist

  • [x] I have reviewed the contribution guidelines
  • [ ] I have signed the CLA (if required)
  • [x] My code includes unit tests (updated existing tests affected by reorganization)
  • [ ] All unit tests and lint checks pass locally
  • [x] My PR contains documentation updates / additions if required

krikera avatar Jun 22 '25 10:06 krikera

@microsoft-github-policy-service agree

krikera avatar Jun 22 '25 10:06 krikera

/azp run

navalev avatar Jun 23 '25 07:06 navalev

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Jun 23 '25 07:06 azure-pipelines[bot]

Thanks! There are some linting errors here and there. Please check the CI output.

omri374 avatar Jun 23 '25 11:06 omri374

Hi @krikera, we left a few comments. Would you be interested in continuing the work on this? Can we help in any way?

omri374 avatar Jun 30 '25 11:06 omri374

Hi @krikera would you mind allowing me to push to your branch? I can make the changes there to update the PR.

omri374 avatar Jul 21 '25 10:07 omri374

closing to continue the work on #1670

omri374 avatar Jul 23 '25 19:07 omri374