feat: Add DOC file support
Summary
Adds support for legacy Microsoft Word DOC files (.doc) to MarkItDown.
Implementation Details
I could not find an out-of-the-box library to do doc to md conversion, so I went with a 2-step approach, converting the doc to docx then converting the docx using the converter module to md. The minor issue here is the dependencies, all libraries require some sort of dependency (usually Libreoffice), I implemented an OS-specific approach that checks if the user is on Linux, it uses the Libreoffice cli tool, but, on Windows it would use MS Word's COM interface, this is to eliminate the need to install external dependencies as much as possible.
Testing
- All existing tests pass
- DocConverter properly registered and accepts DOC files, correctly parses content. (Testing passed on Linux & Windows)
Fixes #23, #1220
really need it
I really need this doc conversion!!!
+1 we need it
+1 on needing this doc change, it will resolve a lot of problems :pray:
+1 the PR is open from a long time. Can't we merge it ?
+1 really needed
+1 would be great if this could get merged.
+1 Please, we yearn for this
+1 please, would make it one step easier for us to use this package