html2text icon indicating copy to clipboard operation
html2text copied to clipboard

WRAP_LIST_ITEMS setting is not respected

Open SebCorbin opened this issue 4 years ago • 1 comments

  • Version 2020.1.16
  • Python version 3.8.7
body = """
1. Error exercitationem debitis magni tenetur dolorum inventore ex. Voluptatibus possimus voluptas quibusdam vel facere eaque sit. Et et hic totam aliquam et ut numquam. Omnis qui consectetur reiciendis. Deserunt qui aut mollitia qui. Dolores omnis aut facere sint et rerum.
2. Modi excepturi velit ab fuga dignissimos qui. Et dolorem ut quam consequatur. Quia repellat deleniti et aut quae in. Cum quidem maiores sint suscipit nobis ipsam.
3. Et tenetur sapiente velit. Neque culpa perspiciatis et molestias voluptatem officia rem. Dolorem reprehenderit recusandae nostrum voluptatem nihil et modi neque. Libero et tempore odit. Saepe quo dolorum voluptas. Aliquam illo nam eos qui eum.
"""
h = html2text.HTML2Text()
h.body_width = 80
h.wrap_list_items = True
print(h.handle(body))

Should normally render

 1. Error exercitationem debitis magni tenetur dolorum inventore ex.
Voluptatibus possimus voluptas quibusdam vel facere eaque sit. Et et hic totam
aliquam et ut numquam. Omnis qui consectetur reiciendis. Deserunt qui aut
mollitia qui. Dolores omnis aut facere sint et rerum.

2. Modi excepturi velit ab fuga dignissimos qui. Et dolorem ut quam
consequatur. Quia repellat deleniti et aut quae in. Cum quidem maiores sint
suscipit nobis ipsam.

3. Et tenetur sapiente velit. Neque culpa perspiciatis et molestias voluptatem
officia rem. Dolorem reprehenderit recusandae nostrum voluptatem nihil et modi
neque. Libero et tempore odit. Saepe quo dolorum voluptas. Aliquam illo nam eos
qui eum.

But instead it return unwrapped text

I suggest changing skipwrap() end of function to:

    # If the text begins with a single -, *, or +, followed by a space,
    # or an integer, followed by a ., followed by a space (in either
    # case optionally proceeded by whitespace), it's a list; don't wrap,
    # unless explicitly specified.
    return bool(
        config.RE_ORDERED_LIST_MATCHER.match(stripped)
        or config.RE_UNORDERED_LIST_MATCHER.match(stripped)
    ) and not wrap_list_items

SebCorbin avatar Apr 08 '21 09:04 SebCorbin

This bug is still present – numbered lists don't respect the wrap_list_items and body_width settings. Note that unordered lists are wrapped correctly, only ordered ones stay unwrapped. Looking at skipwrap(), it seems it has two parts where it tries to react to lists, one as shown in the description where it uses the RE_ORDERED_LIST_MATCHER and RE_UNORDERED_LIST_MATCHER regexes, but there is also another part before that where it matches on literal list item characters:

    # I'm not sure what this is for; I thought it was to detect lists,
    # but there's a <br>-inside-<span> case in one of the tests that
    # also depends upon it.
    if stripped[0:1] in ("-", "*") and not stripped[0:2] == "**":
        return not wrap_list_items

I would assume this is why it works correctly for unordered lists.

So @SebCorbin's fix looks like it should do the trick.

TB-effective avatar Mar 21 '24 16:03 TB-effective