pymarkdown
pymarkdown copied to clipboard
BadTokenizationError when a code block uses tab indenting and is embedded in a list
With md010 disabled, it is possible to lint this document (where line 3 starts with a tab character):
Consider this code:
code block here
The following document is not supported:
- Consider this code:
code block here
It results in the following:
$ python3 main.py scan-stdin </tmp/tab-issue-example
Unexpected Error(BadTokenizationError): A project assertion failed on line 3 of the current document.
I had expected that tabbed blocks can be handled consistently whether they are in list context or not.
this specific instance is fixed. we have trouble finding users that actively use tabs in their markdown documents, so please let us know if you find any other issues. Note that we have an extensive suite of tests that do use tab characters, but it always helps to have users who are using that feature
this has been released. looking for verification.
Thank you. It seems I can still reproduce the issue.
felix@jammy:~/mgmt$ pipenv run pymarkdownlnt scan basic.md
Unexpected Error(BadTokenizationError): An unhandled error occurred processing the document.
felix@jammy:~/mgmt$ pipenv run pip show pymarkdownlnt
Name: pymarkdownlnt
Version: 0.9.18
Summary: A GitHub Flavored Markdown compliant Markdown linter.
Home-page: https://github.com/jackdewinter/pymarkdown
Author: Jack De Winter
Author-email: [email protected]
License:
Location: /home/felix/.local/share/virtualenvs/mgmt-kvFocuVj/lib/python3.10/site-packages
Requires: application-properties, columnar, typing-extensions
Required-by:
felix@jammy:~/mgmt$
(basic.md holds the example from the original report)
Trying to figure out how to reproduce this. Based on your initial report, added two tests test_extra_042x
and test_extra_042a
for the parsing itself. I also added 2 more tests issue-1015-positive
and issue-1015-negative
to the core tests, and this pull request shows them being executed on Windows, Linux, and MaxOs without issues.
can I get you to execute pipenv run pymarkdownlnt --stack-trace scan basic.md
instead of the command line you used above? the results from that command may give me more to go on.
Hey, thanks for your consideration. To be extra sure, I created a fresh pipenv directory, with the same result. Here is the requested output:
felix@jammy:~/fresh-env$ pipenv run pymarkdownlnt --stack-trace scan basic.md
Application logging set to 'DEBUG'.
DEBUG:pymarkdown.application_configuration_helper:Looking for any standard python configuration files.
DEBUG:pymarkdown.application_configuration_helper:Looking for application specific configuration files.
DEBUG:pymarkdown.application_configuration_helper:Attempting to find/load '/home/felix/fresh-env/.pymarkdown' as a default JSON configuration file.
DEBUG:pymarkdown.application_configuration_helper:Attempting to find/load '/home/felix/fresh-env/.pymarkdown.yaml' as a default YAML configuration file.
DEBUG:pymarkdown.application_configuration_helper:Attempting to find/load '/home/felix/fresh-env/.pymarkdown.yml' as a default YAML configuration file.
DEBUG:pymarkdown.application_configuration_helper:No default configuration files were loaded.
DEBUG:application_properties.application_properties:property_name=mode.strict-config
DEBUG:application_properties.application_properties:property_name=mode.return_code_scheme
INFO:pymarkdown.main:Configuration loaded and applied. Initial state setup completed.
DEBUG:application_properties.application_properties:property_name=log.file
DEBUG:application_properties.application_properties:property_name=log.level
Unexpected Error(BadTokenizationError): An unhandled error occurred processing the document.
Traceback (most recent call last):
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 194, in __parse_blocks_pass
) = self.__parse_blocks_pass_next_line(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 255, in __parse_blocks_pass_next_line
) = self.__main_pass_did_not_start_close(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 365, in __main_pass_did_not_start_close
) = ContainerBlockProcessor.parse_line_for_container_blocks(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/container_blocks/container_block_processor.py", line 166, in parse_line_for_container_blocks
ContainerBlockLeafProcessor.handle_leaf_tokens(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/container_blocks/container_block_leaf_processor.py", line 85, in handle_leaf_tokens
ContainerBlockLeafProcessor.__process_leaf_tokens(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/container_blocks/container_block_leaf_processor.py", line 182, in __process_leaf_tokens
ContainerBlockLeafProcessor.__parse_line_for_leaf_blocks(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/container_blocks/container_block_leaf_processor.py", line 265, in __parse_line_for_leaf_blocks
or LeafBlockProcessorParagraph.parse_paragraph(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/leaf_blocks/leaf_block_processor_paragraph.py", line 89, in parse_paragraph
) = LeafBlockProcessorParagraph.__handle_paragraph_prep(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/leaf_blocks/leaf_block_processor_paragraph.py", line 447, in __handle_paragraph_prep
LeafBlockProcessorParagraph.__paragraph_prep_whitespace_with_tab(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/leaf_blocks/leaf_block_processor_paragraph.py", line 506, in __paragraph_prep_whitespace_with_tab
assert split_tab_with_block_quote_suffix
AssertionError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 131, in __transform
first_pass_results = self.__parse_blocks_pass(do_add_end_of_stream_token)
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 204, in __parse_blocks_pass
raise BadTokenizationError(error_message) from this_exception
pymarkdown.general.bad_tokenization_error.BadTokenizationError: A project assertion failed on line 3 of the current document.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/main.py", line 432, in main
scan_result = self.__scan_files_if_no_errors(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/main.py", line 367, in __scan_files_if_no_errors
did_fix_any_files, did_fail_any_file = fsh.process_files_to_scan(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 101, in process_files_to_scan
did_succeed = self.__scan_specific_file(next_file, next_file)
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 144, in __scan_specific_file
self.__scan_file(source_provider, next_file_name)
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 169, in __scan_file
actual_tokens = self.__tokenizer.transform_from_provider(
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 99, in transform_from_provider
return self.__transform(do_add_end_of_stream_token)
File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 147, in __transform
raise BadTokenizationError(
pymarkdown.general.bad_tokenization_error.BadTokenizationError: An unhandled error occurred processing the document.
In case it could matter
felix@jammy:~/fresh-env$ pipenv run python --version
Python 3.10.12
FWIW I tried to find the mentioned tests in https://github.com/jackdewinter/pymarkdown/actions/runs/8492183767/job/23264927781?pr=1050 and could not.
Trying to figure out how to reproduce this. Based on your initial report, added two tests
test_extra_042x
andtest_extra_042a
for the parsing itself. I also added 2 more testsissue-1015-positive
andissue-1015-negative
to the core tests, and this pull request shows them being executed on Windows, Linux, and MaxOs without issues.
@ffrank greetings... I know this might sound monotonous, but could you try it with the latest release.
If it still fails, can you attach the exact file to this ticket so I can try and reproduce it with that? I am at a loss here on how to repro it on "not-franks" machine
Not at all, thank you for following up on this.
And double thank you for fixing the most basic reproduction case.
$ pipenv run pymarkdownlnt scan basic.md
basic.md:1:1: MD041: First line in file should be a top level heading (first-line-heading,first-line-h1)
basic.md:3:1: MD010: Hard tabs [Column: 1] (no-hard-tabs)
However, the issue that brought me here persists. I reproduce it with this new basic.md:
- Consider this code:
```
code block here
```
For a more realistic if more complex reproduction, you can check out this branch: https://github.com/ffrank/mgmt/tree/noruby
In the root of that repo, I can reproduce the issue with
pipenv run pymarkdownlnt scan docs/language-guide.md
If I get some time, I will try to find out if I can build on your existing fix in order to figure this one out.
I have spelunked the debug output (and enabled some more of it in some of the functions near the broken assertion), but it's hard for me to wrap my head around the logic for handling indentation.
I got a hunch though: Is it possible that the logic works on an assumption that indentation levels will either be all soft tabs, or alternating between all tabs (with a width of 4 characters) and "all tabs plus two spaces"?
Trying to visualize my impression of what the parser might assume:
<---content--->
SS<---content--->
SS<---content--->
TTTT<---content--->
TTTTSS<---content--->
TTTT<---content--->
<---content--->
Where TTTT is a tab stop and S is a space.
I should add, this is the assertion that fails
AssertionError: Adjusted original line must be defined by now.
I just committed a change that seems to fix this issue. I downloaded the sample page that you noted, and it now scans cleanly.
can you double check?
And I just released a new version a couple of days ago.
closing due to lack of response