markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

feat: Add batch processing capability for directory conversion

Open HossyWorlds opened this issue 5 months ago • 6 comments

Add batch processing capability for directory conversion

This PR is related #1371

Changes Made

This PR adds batch processing functionality to the MarkItDown CLI, allowing users to convert multiple files in a directory to Markdown format in a single operation.

New CLI Options

  • -b, --batch: Enable batch processing mode
  • -r, --recursive: Process subdirectories recursively
  • --types: Filter by specific file extensions (e.g., pdf,docx,pptx)

Implementation Details

  • Added batch processing logic to __main__.py
  • Maintains directory structure in output
  • Supports all existing MarkItDown file formats
  • Integrates seamlessly with existing options (--use-plugins, --use-docintel, etc.)
  • Provides progress reporting and error handling

User Pain Points Solved

  • Efficiency: Eliminates the need to run individual commands for each file
  • Consistency: Ensures all files are processed with the same settings
  • Scalability: Handles large document collections efficiently
  • Workflow Integration: Better integration with automated processing pipelines

Usage Examples

# Basic batch processing
markitdown --batch ./documents --output ./converted

# Recursive processing with file type filter
markitdown --batch ./documents --recursive --types pdf,docx,pptx --output ./converted

# With existing options
markitdown --batch ./documents --use-plugins --output ./converted

Testing

All tests pass successfully:

  • ✅ Existing functionality tests (single file conversion, stdin processing, etc.)
  • ✅ New batch processing tests
  • ✅ Error handling tests
  • ✅ Integration tests with existing options
  • ✅ Backward compatibility verified

Test Coverage

  • Added comprehensive CLI tests in test_cli_misc.py
  • Verified existing functionality remains intact
  • Tested error cases and edge conditions
  • Confirmed proper integration with existing options

Backward Compatibility

This change is fully backward compatible:

  • All existing CLI commands continue to work as before
  • No breaking changes to the API
  • Existing options (--use-plugins, --use-docintel, etc.) work seamlessly with batch mode

Files Modified

  • packages/markitdown/src/markitdown/__main__.py: Added batch processing logic
  • packages/markitdown/tests/test_cli_misc.py: Added comprehensive tests for new functionality

HossyWorlds avatar Jul 19 '25 12:07 HossyWorlds

@microsoft-github-policy-service agree

HossyWorlds avatar Jul 19 '25 12:07 HossyWorlds

@tifilipebr Thank you for your reviewing.

Addressed feedback on reusing existing extension references and separating file validation by removing hardcoded extension list and leveraging existing validation system.

HossyWorlds avatar Jul 20 '25 03:07 HossyWorlds

Hey, really looking forward to this getting merged!

I ran a quick test and found a potential issue with the current use of with_suffix('.md') in _handle_batch_processing . It replaces the original file suffix, which causes files with the same name but different extensions to overwrite each other.

Here’s the test I ran:

~/markitdown pr-1372* python-3.12.3 ❯ mkdir test
~/markitdown pr-1372* python-3.12.3 ❯ touch test/test.md test/test.txt test/test.py
~/markitdown pr-1372* python-3.12.3 ❯ markitdown -b test
Found 3 files to process
[1/3] Processing: test.md
✓ Success: test.md
[2/3] Processing: test.py
✓ Success: test.py
[3/3] Processing: test.txt
✓ Success: test.txt

Batch processing complete!
Success: 3 files
Failed: 0 files
Unsupported: 0 files
Output directory: test/converted
~/markitdown pr-1372* python-3.12.3 ❯ ls test/converted
test.md

Because with_suffix('.md') replaces the suffix, all files end up saved as test.md in the output directory, overwriting each other.

I think it would be better to append .md instead of replacing the suffix, or at least provide an option to control this behavior with proper error handling.

janthmueller avatar Jul 31 '25 23:07 janthmueller

@janthmueller Thank you for reviewing!! I've fixed!!

HossyWorlds avatar Aug 01 '25 09:08 HossyWorlds

Curious if this will be merged soon as it would be a great feature to have out of the box!

tomtom215 avatar Sep 02 '25 08:09 tomtom215

Bump

anoblet avatar Nov 13 '25 01:11 anoblet