feat: Add batch processing capability for directory conversion
Add batch processing capability for directory conversion
This PR is related #1371
Changes Made
This PR adds batch processing functionality to the MarkItDown CLI, allowing users to convert multiple files in a directory to Markdown format in a single operation.
New CLI Options
-
-b, --batch: Enable batch processing mode -
-r, --recursive: Process subdirectories recursively -
--types: Filter by specific file extensions (e.g.,pdf,docx,pptx)
Implementation Details
- Added batch processing logic to
__main__.py - Maintains directory structure in output
- Supports all existing MarkItDown file formats
- Integrates seamlessly with existing options (
--use-plugins,--use-docintel, etc.) - Provides progress reporting and error handling
User Pain Points Solved
- Efficiency: Eliminates the need to run individual commands for each file
- Consistency: Ensures all files are processed with the same settings
- Scalability: Handles large document collections efficiently
- Workflow Integration: Better integration with automated processing pipelines
Usage Examples
# Basic batch processing
markitdown --batch ./documents --output ./converted
# Recursive processing with file type filter
markitdown --batch ./documents --recursive --types pdf,docx,pptx --output ./converted
# With existing options
markitdown --batch ./documents --use-plugins --output ./converted
Testing
All tests pass successfully:
- ✅ Existing functionality tests (single file conversion, stdin processing, etc.)
- ✅ New batch processing tests
- ✅ Error handling tests
- ✅ Integration tests with existing options
- ✅ Backward compatibility verified
Test Coverage
- Added comprehensive CLI tests in
test_cli_misc.py - Verified existing functionality remains intact
- Tested error cases and edge conditions
- Confirmed proper integration with existing options
Backward Compatibility
This change is fully backward compatible:
- All existing CLI commands continue to work as before
- No breaking changes to the API
- Existing options (
--use-plugins,--use-docintel, etc.) work seamlessly with batch mode
Files Modified
-
packages/markitdown/src/markitdown/__main__.py: Added batch processing logic -
packages/markitdown/tests/test_cli_misc.py: Added comprehensive tests for new functionality
@microsoft-github-policy-service agree
@tifilipebr Thank you for your reviewing.
Addressed feedback on reusing existing extension references and separating file validation by removing hardcoded extension list and leveraging existing validation system.
Hey, really looking forward to this getting merged!
I ran a quick test and found a potential issue with the current use of with_suffix('.md') in _handle_batch_processing . It replaces the original file suffix, which causes files with the same name but different extensions to overwrite each other.
Here’s the test I ran:
~/markitdown pr-1372* python-3.12.3 ❯ mkdir test
~/markitdown pr-1372* python-3.12.3 ❯ touch test/test.md test/test.txt test/test.py
~/markitdown pr-1372* python-3.12.3 ❯ markitdown -b test
Found 3 files to process
[1/3] Processing: test.md
✓ Success: test.md
[2/3] Processing: test.py
✓ Success: test.py
[3/3] Processing: test.txt
✓ Success: test.txt
Batch processing complete!
Success: 3 files
Failed: 0 files
Unsupported: 0 files
Output directory: test/converted
~/markitdown pr-1372* python-3.12.3 ❯ ls test/converted
test.md
Because with_suffix('.md') replaces the suffix, all files end up saved as test.md in the output directory, overwriting each other.
I think it would be better to append .md instead of replacing the suffix, or at least provide an option to control this behavior with proper error handling.
@janthmueller Thank you for reviewing!! I've fixed!!
Curious if this will be merged soon as it would be a great feature to have out of the box!
Bump