markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

refactor: split _markitdown.py into modular components

Open t3tra-dev opened this issue 1 year ago • 1 comments

Description

This PR addresses the growing complexity of _markitdown.py by splitting it into smaller, more focused modules. The changes improve code organization and maintainability while preserving all existing functionality.

Changes

  • Created a new converters/ package to house different converter implementations
  • Split converters into logical groups (document, web, media, text, archive)
  • Moved core MarkItDown class functionality to core.py
  • Separated exception classes into exceptions.py
  • Updated imports and tests to reflect new structure

Testing

  • All existing tests pass without modification
  • Verified no functionality changes

Implementation Details

The refactoring follows these principles:

  1. Single Responsibility: Each module handles a specific type of conversion
  2. Open/Closed: New converters can be added without modifying existing code
  3. Interface Segregation: Clear base class and consistent converter interface
  4. Dependency Inversion: Core MarkItDown class depends on abstractions

Migration Notes

This is a non-breaking change as all public APIs remain unchanged. Internal imports are updated to reflect the new structure.

t3tra-dev avatar Jan 03 '25 11:01 t3tra-dev

@microsoft-github-policy-service agree

t3tra-dev avatar Jan 03 '25 12:01 t3tra-dev

Thanks for the work on this. It was included in a recent refactor for 0.1.0a1

afourney avatar Mar 06 '25 21:03 afourney