pymarkdown icon indicating copy to clipboard operation
pymarkdown copied to clipboard

BadTokenizationError when a code block uses tab indenting and is embedded in a list

Open ffrank opened this issue 1 year ago • 6 comments

With md010 disabled, it is possible to lint this document (where line 3 starts with a tab character):

Consider this code:

        code block here

The following document is not supported:

- Consider this code:

	code block here

It results in the following:

$ python3 main.py scan-stdin </tmp/tab-issue-example


Unexpected Error(BadTokenizationError): A project assertion failed on line 3 of the current document.

I had expected that tabbed blocks can be handled consistently whether they are in list context or not.

ffrank avatar Feb 29 '24 20:02 ffrank

this specific instance is fixed. we have trouble finding users that actively use tabs in their markdown documents, so please let us know if you find any other issues. Note that we have an extensive suite of tests that do use tab characters, but it always helps to have users who are using that feature

jackdewinter avatar Mar 19 '24 01:03 jackdewinter

this has been released. looking for verification.

jackdewinter avatar Mar 23 '24 01:03 jackdewinter

Thank you. It seems I can still reproduce the issue.

felix@jammy:~/mgmt$ pipenv run pymarkdownlnt scan basic.md


Unexpected Error(BadTokenizationError): An unhandled error occurred processing the document.
felix@jammy:~/mgmt$ pipenv run pip show pymarkdownlnt
Name: pymarkdownlnt
Version: 0.9.18
Summary: A GitHub Flavored Markdown compliant Markdown linter.
Home-page: https://github.com/jackdewinter/pymarkdown
Author: Jack De Winter
Author-email: [email protected]
License:
Location: /home/felix/.local/share/virtualenvs/mgmt-kvFocuVj/lib/python3.10/site-packages
Requires: application-properties, columnar, typing-extensions
Required-by:
felix@jammy:~/mgmt$

(basic.md holds the example from the original report)

ffrank avatar Mar 23 '24 20:03 ffrank

Trying to figure out how to reproduce this. Based on your initial report, added two tests test_extra_042x and test_extra_042a for the parsing itself. I also added 2 more tests issue-1015-positive and issue-1015-negative to the core tests, and this pull request shows them being executed on Windows, Linux, and MaxOs without issues.

can I get you to execute pipenv run pymarkdownlnt --stack-trace scan basic.md instead of the command line you used above? the results from that command may give me more to go on.

jackdewinter avatar Mar 30 '24 17:03 jackdewinter

Hey, thanks for your consideration. To be extra sure, I created a fresh pipenv directory, with the same result. Here is the requested output:

felix@jammy:~/fresh-env$ pipenv run pymarkdownlnt --stack-trace scan basic.md
Application logging set to 'DEBUG'.
DEBUG:pymarkdown.application_configuration_helper:Looking for any standard python configuration files.
DEBUG:pymarkdown.application_configuration_helper:Looking for application specific configuration files.
DEBUG:pymarkdown.application_configuration_helper:Attempting to find/load '/home/felix/fresh-env/.pymarkdown' as a default JSON configuration file.
DEBUG:pymarkdown.application_configuration_helper:Attempting to find/load '/home/felix/fresh-env/.pymarkdown.yaml' as a default YAML configuration file.
DEBUG:pymarkdown.application_configuration_helper:Attempting to find/load '/home/felix/fresh-env/.pymarkdown.yml' as a default YAML configuration file.
DEBUG:pymarkdown.application_configuration_helper:No default configuration files were loaded.
DEBUG:application_properties.application_properties:property_name=mode.strict-config
DEBUG:application_properties.application_properties:property_name=mode.return_code_scheme
INFO:pymarkdown.main:Configuration loaded and applied.  Initial state setup completed.
DEBUG:application_properties.application_properties:property_name=log.file
DEBUG:application_properties.application_properties:property_name=log.level


Unexpected Error(BadTokenizationError): An unhandled error occurred processing the document.
Traceback (most recent call last):
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 194, in __parse_blocks_pass
    ) = self.__parse_blocks_pass_next_line(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 255, in __parse_blocks_pass_next_line
    ) = self.__main_pass_did_not_start_close(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 365, in __main_pass_did_not_start_close
    ) = ContainerBlockProcessor.parse_line_for_container_blocks(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/container_blocks/container_block_processor.py", line 166, in parse_line_for_container_blocks
    ContainerBlockLeafProcessor.handle_leaf_tokens(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/container_blocks/container_block_leaf_processor.py", line 85, in handle_leaf_tokens
    ContainerBlockLeafProcessor.__process_leaf_tokens(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/container_blocks/container_block_leaf_processor.py", line 182, in __process_leaf_tokens
    ContainerBlockLeafProcessor.__parse_line_for_leaf_blocks(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/container_blocks/container_block_leaf_processor.py", line 265, in __parse_line_for_leaf_blocks
    or LeafBlockProcessorParagraph.parse_paragraph(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/leaf_blocks/leaf_block_processor_paragraph.py", line 89, in parse_paragraph
    ) = LeafBlockProcessorParagraph.__handle_paragraph_prep(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/leaf_blocks/leaf_block_processor_paragraph.py", line 447, in __handle_paragraph_prep
    LeafBlockProcessorParagraph.__paragraph_prep_whitespace_with_tab(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/leaf_blocks/leaf_block_processor_paragraph.py", line 506, in __paragraph_prep_whitespace_with_tab
    assert split_tab_with_block_quote_suffix
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 131, in __transform
    first_pass_results = self.__parse_blocks_pass(do_add_end_of_stream_token)
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 204, in __parse_blocks_pass
    raise BadTokenizationError(error_message) from this_exception
pymarkdown.general.bad_tokenization_error.BadTokenizationError: A project assertion failed on line 3 of the current document.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/main.py", line 432, in main
    scan_result = self.__scan_files_if_no_errors(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/main.py", line 367, in __scan_files_if_no_errors
    did_fix_any_files, did_fail_any_file = fsh.process_files_to_scan(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 101, in process_files_to_scan
    did_succeed = self.__scan_specific_file(next_file, next_file)
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 144, in __scan_specific_file
    self.__scan_file(source_provider, next_file_name)
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/file_scan_helper.py", line 169, in __scan_file
    actual_tokens = self.__tokenizer.transform_from_provider(
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 99, in transform_from_provider
    return self.__transform(do_add_end_of_stream_token)
  File "/home/felix/.local/share/virtualenvs/fresh-env-kjmZxv7G/lib/python3.10/site-packages/pymarkdown/general/tokenized_markdown.py", line 147, in __transform
    raise BadTokenizationError(
pymarkdown.general.bad_tokenization_error.BadTokenizationError: An unhandled error occurred processing the document.

In case it could matter

felix@jammy:~/fresh-env$ pipenv run python --version
Python 3.10.12

ffrank avatar Apr 01 '24 16:04 ffrank

FWIW I tried to find the mentioned tests in https://github.com/jackdewinter/pymarkdown/actions/runs/8492183767/job/23264927781?pr=1050 and could not.

Trying to figure out how to reproduce this. Based on your initial report, added two tests test_extra_042x and test_extra_042a for the parsing itself. I also added 2 more tests issue-1015-positive and issue-1015-negative to the core tests, and this pull request shows them being executed on Windows, Linux, and MaxOs without issues.

ffrank avatar Apr 01 '24 16:04 ffrank

@ffrank greetings... I know this might sound monotonous, but could you try it with the latest release.

If it still fails, can you attach the exact file to this ticket so I can try and reproduce it with that? I am at a loss here on how to repro it on "not-franks" machine

jackdewinter avatar May 08 '24 03:05 jackdewinter

Not at all, thank you for following up on this.

And double thank you for fixing the most basic reproduction case.

$ pipenv run pymarkdownlnt scan basic.md
basic.md:1:1: MD041: First line in file should be a top level heading (first-line-heading,first-line-h1)
basic.md:3:1: MD010: Hard tabs [Column: 1] (no-hard-tabs)

However, the issue that brought me here persists. I reproduce it with this new basic.md:

- Consider this code:

        ```
                code block here
        ```

For a more realistic if more complex reproduction, you can check out this branch: https://github.com/ffrank/mgmt/tree/noruby

In the root of that repo, I can reproduce the issue with

pipenv run pymarkdownlnt scan docs/language-guide.md

If I get some time, I will try to find out if I can build on your existing fix in order to figure this one out.

ffrank avatar May 10 '24 21:05 ffrank

I have spelunked the debug output (and enabled some more of it in some of the functions near the broken assertion), but it's hard for me to wrap my head around the logic for handling indentation.

I got a hunch though: Is it possible that the logic works on an assumption that indentation levels will either be all soft tabs, or alternating between all tabs (with a width of 4 characters) and "all tabs plus two spaces"?

Trying to visualize my impression of what the parser might assume:

<---content--->
SS<---content--->
SS<---content--->
TTTT<---content--->
TTTTSS<---content--->
TTTT<---content--->
<---content--->

Where TTTT is a tab stop and S is a space.

ffrank avatar May 20 '24 12:05 ffrank

I should add, this is the assertion that fails

AssertionError: Adjusted original line must be defined by now.

ffrank avatar May 20 '24 12:05 ffrank

I just committed a change that seems to fix this issue. I downloaded the sample page that you noted, and it now scans cleanly.

can you double check?

jackdewinter avatar May 26 '24 02:05 jackdewinter

And I just released a new version a couple of days ago.

jackdewinter avatar Jun 02 '24 17:06 jackdewinter

closing due to lack of response

jackdewinter avatar Jun 08 '24 20:06 jackdewinter