markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Added chm file support to markitdown

Open DSCmatter opened this issue 5 months ago • 5 comments

Installation: Users can install with: pip install markitdown[chm] or pip install markitdown[all]

Usage: Works successfully

Was not possible to test it since: https://github.com/microsoft/markitdown/issues/1364

But I put test file(s) or url from test.chm

Any suggestions or improvements are welcome.

This addresses issue #14


Can anyone review this?

Thanks

pre-commit tests

DSCmatter avatar Jul 18 '25 21:07 DSCmatter

@microsoft-github-policy-service agree

DSCmatter avatar Jul 18 '25 21:07 DSCmatter

Update: Was able to test it, but it gave 20 errors? Which I didn't even touched.

Anyhow had to write this in pyproject.toml - line 77

[tool.pytest.ini_options]
pythonpath = "src"

to make markitdown detectable

DSCmatter avatar Jul 19 '25 01:07 DSCmatter

How to fix the issue: To solve the problem where pytest can't detect the markitdown package (which led to 20 errors), you just need to tell pytest where your source code lives.

Add this to your pyproject.toml file (around line 77 or under the existing [tool...] settings):

toml

[tool.pytest.ini_options] pythonpath = "src" Why this matters: The project uses a src/ directory structure, so without this setting, pytest doesn’t know where to look for the code. That’s why you're seeing errors, even if everything is actually working fine.

After adding the line, run your tests again using:

bash

pytest Or if you're using pre-commit hooks:

bash

pre-commit run --all-files

yossefelnggar avatar Jul 19 '25 05:07 yossefelnggar

@yossefelnggar I already did that, check files changed, and I also said about that in my above comment too.

Hence, closing Issue #1364

DSCmatter avatar Jul 19 '25 10:07 DSCmatter

summary of errors after doing > hatch test

================================================================= short test summary info ==================================================================
FAILED tests/test_cli_vectors.py::test_output_to_stdout[test_vector2] - AssertionError: CLI exited with error: Traceback (most recent call last):
FAILED tests/test_cli_vectors.py::test_output_to_stdout[test_vector4] - AssertionError: CLI exited with error: Traceback (most recent call last):
FAILED tests/test_cli_vectors.py::test_output_to_stdout[test_vector6] - AssertionError: assert '| 名前 | 年齢 | 住所 |' in '| ?? | ?? | ?? |\n| --- | --- | --- |\n| ???? | 30 | ?? |\n| ???? | 25 | ?? |\n| ??? | 35 | ??? |\n'
FAILED tests/test_cli_vectors.py::test_output_to_file[test_vector0] - UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4634: character maps to <undefined>
FAILED tests/test_cli_vectors.py::test_output_to_file[test_vector2] - AssertionError: CLI exited with error: Traceback (most recent call last):
FAILED tests/test_cli_vectors.py::test_output_to_file[test_vector4] - AssertionError: CLI exited with error: Traceback (most recent call last):
FAILED tests/test_cli_vectors.py::test_output_to_file[test_vector6] - UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 3: character maps to <undefined>
FAILED tests/test_cli_vectors.py::test_output_to_file[test_vector10] - UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 94302: character maps to <undefined>
FAILED tests/test_cli_vectors.py::test_input_from_stdin_without_hints[test_vector2] - AssertionError: CLI exited with error: Traceback (most recent call last):
FAILED tests/test_cli_vectors.py::test_input_from_stdin_without_hints[test_vector4] - AssertionError: CLI exited with error: Traceback (most recent call last):
FAILED tests/test_cli_vectors.py::test_input_from_stdin_without_hints[test_vector6] - AssertionError: assert '| 名前 | 年齢 | 住所 |' in '| ?? | ?? | ?? |\r\n| --- | --- | --- |\r\n| ???? | 30 | ?? |\r\n| ???? | 25 | ?? |\r\n| ??? | 35 | ...
FAILED tests/test_cli_vectors.py::test_convert_url[test_vector2] - AssertionError: CLI exited with error: b'Traceback (most recent call last):\r\n  File "<frozen runpy>", line 198, in _run_module_as_main\r\n  File "<fro...
FAILED tests/test_cli_vectors.py::test_convert_url[test_vector4] - AssertionError: CLI exited with error: b'Traceback (most recent call last):\r\n  File "<frozen runpy>", line 198, in _run_module_as_main\r\n  File "<fro...
FAILED tests/test_cli_vectors.py::test_convert_url[test_vector6] - AssertionError: assert '| 名前 | 年齢 | 住所 |' in '| ?? | ?? | ?? |\r\n| --- | --- | --- |\r\n| ???? | 30 | ?? |\r\n| ???? | 25 | ?? |\r\n| ??? | 35 | ...
FAILED tests/test_cli_vectors.py::test_output_to_file_with_data_uris[test_vector0] - UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4634: character maps to <undefined>
FAILED tests/test_module_misc.py::test_file_uris - AssertionError: assert 'D:\\path\\to\\file.txt' == '/path/to/file.txt'
FAILED tests/test_module_misc.py::test_markitdown_remote - AssertionError: assert '## AutoGen FULL Tutorial with Python (Step-By-Step)' in '[About](https://www.youtube.com/about/)[Press](https://www.youtube.com/...
FAILED tests/test_module_misc.py::test_speech_transcription - AssertionError: assert (('1' in '### audio transcript:\n1 2 3 4') and ('2' in '### audio transcript:\n1 2 3 4') and ('3' in '### audio transcript:\n1 2 ...
FAILED tests/test_module_vectors.py::test_guess_stream_info[test_vector7] - AssertionError: assert None == 'application/vnd.ms-htmlhelp'
FAILED tests/test_module_vectors.py::test_convert_http_uri[test_vector7] - requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/microsoft/markitdown/refs/heads/main/packages/mark...
============================================= 20 failed, 157 passed, 2 skipped, 1 warning in 184.26s (0:03:04) ============================================= 

DSCmatter avatar Jul 19 '25 10:07 DSCmatter