django-htmlmin icon indicating copy to clipboard operation
django-htmlmin copied to clipboard

Excessive whitespace removal

Open bradbeattie opened this issue 13 years ago • 19 comments

The extra whitespace in <a href="asdf">asdf</a>......<a href="asdf">asdf</a> Should probably collapse down to <a href="asdf">asdf</a>.<a href="asdf">asdf</a> But your code turns it into <a href="asdf">asdf</a><a href="asdf">asdf</a>

(spaces represented as periods here as the markdown is collapsing my whitespace)

bradbeattie avatar Nov 13 '11 21:11 bradbeattie

Hi,

the issue happens also inside p tags that span multiple lines. For example:

<p>this paragraph spans
  multiple lines<p> 

shows on the browser as:

this paragraph spansmultiple lines

Note that not only the multiple leading spaces on the second line are removed, the new line is not converted to a white space, so the words spans and multiple get together.

Great tool BTW.

jcampanello avatar Nov 16 '11 00:11 jcampanello

Hi!

I'm working on this, but I had a problem reproducing the space problem between two tags... I wrote a test to test the exactly same link case that you showed up, but I couldn't reproduce it, the test is passing... could you tell me what version you're using...?

The problem reported by @jcampanello I was able to catch on the tests, but I'll not finish the code right now. I just wanted to give an update for you guys that we are working on that. :)

flavianmissi avatar Nov 16 '11 01:11 flavianmissi

Hi Flavia,

the package got installed as "django_htmlmin-0.5-py2.6.egg" (i.e: html-min v 0.5 on python 2.6).

Do you have an idea of when do you plan to release the fix?

Thanks for the effort!

José Luis

jcampanello avatar Nov 16 '11 13:11 jcampanello

Howdy!

I'll probably solve this bug until the end of this week.

;)

flavianmissi avatar Nov 16 '11 13:11 flavianmissi

Super! Thanks again!

jcampanello avatar Nov 16 '11 13:11 jcampanello

Howdy!

Please, update to 0.5.1 :)

flavianmissi avatar Nov 17 '11 17:11 flavianmissi

Flavia,

did the upgrade. Everything seems ok.

Thanks again!

jcampanello avatar Nov 17 '11 19:11 jcampanello

According to @bradbeattie, there are still issues with white spaces removal between two tags. e.g. <button>First</button> <button>Second</button>

I'm gonna take a look at this, and see if makes sense to fix, as soon as I can :)

flavianmissi avatar Nov 19 '11 13:11 flavianmissi

Note that buttons are just one example. Inline divs will also be an issue, but you can't tell whether a div will be online or not without deep inspection of the CSS and JavaScript. And what about elements with white-space: pre?

bradbeattie avatar Nov 19 '11 17:11 bradbeattie

div's are not supposed to be inline, ever. If you need an inline element, don't use a div.

Anyway, this issue is a little tricky, I'm thinking about it...

fsouza avatar Nov 20 '11 14:11 fsouza

http://www.brunildo.org/test/inline-block.html

bradbeattie avatar Nov 20 '11 16:11 bradbeattie

Right, thanks for sharing :)

fsouza avatar Nov 20 '11 16:11 fsouza

i'm still having the issue. Any news fsouza?

ghost avatar Mar 16 '12 13:03 ghost

not yet, sorry for the delay.

I'll take a look in the next weekend.

fsouza avatar Mar 19 '12 18:03 fsouza

Any news on this?

neoascetic avatar Sep 13 '12 06:09 neoascetic

ping. any news on this ? It is messing up my pages.

atodorov avatar Feb 12 '14 22:02 atodorov

Looking at the code and the tests, I can see why the whitespaces are dumped and why the test passes (it is made so it passes but does not cover the real problem).

In htmlmin/tests/resources/line_break.html the line starts with a whitespace which makes the test pass and gives the illusion that line break and whitespace is properly covered.

I think the htmlmin.minify.html_minify() logic is broken by design, as soon as the source is broken on line breaks it makes the proper handling of line breaks to spaces, multiple spaces, etc very difficult. I see several edge cases like \n where this won't be rendered properly. And of course the simple <p>my text\nbroken in two lines</p> is not rendered properly.

I think that a better way (probably slower though) would be to use beautifulsoup to dive in the dom tree and work on a element base. I'll try to implement a solution along those lines.

hrbonz avatar Feb 17 '14 01:02 hrbonz

I'm coming back with some good news, I started by cleaning up a lot of the tests that were incorrect (trailing spaces just before </body> in htmlmin/tests/test_middleware.py for example). Then, I implemented what I think is a better and stronger way to parse the html source. A big problem I saw with the previous code was in the multiline handling and some edge cases. I think the code now is more robust. It's also faster! I couldn't believe it in the beginning but you can check by yourself, it's most of the time ~40% faster:

running 10000 rounds
with_html_content_in_pre.html
    old style: 38.43s
    recurse: 21.97s (57.16%)
with_textarea.html
    old style: 44.63s
    recurse: 25.22s (56.50%)
without_doctype.html
    old style: 34.30s
    recurse: 18.61s (54.27%)
with_entities_in_textarea.html
    old style: 49.32s
    recurse: 29.76s (60.35%)
html5.html
    old style: 58.72s
    recurse: 34.28s (58.39%)
blogpost.html
    old style: 231.60s
    recurse: 143.30s (61.87%)
with_pre.html
    old style: 97.18s
    recurse: 73.11s (75.23%)
with_html_content_in_javascript.html
    old style: 55.58s
    recurse: 32.67s (58.79%)
with_javascript.html
    old style: 54.14s
    recurse: 31.36s (57.92%)
with_html_content_in_textarea.html
    old style: 46.59s
    recurse: 27.07s (58.11%)
with_textarea_with_blank_lines.html
    old style: 44.96s
    recurse: 25.55s (56.82%)
non_ascii_in_excluded_element.html
    old style: 40.09s
    recurse: 22.13s (55.20%)
with_multiple_comments.html
    old style: 41.99s
    recurse: 23.58s (56.16%)
multiple_spaces.html
    old style: 32.43s
    recurse: 16.68s (51.43%)
with_comments_to_exclude.html
    old style: 40.55s
    recurse: 22.58s (55.68%)
with_comments.html
    old style: 40.51s
    recurse: 22.56s (55.68%)
with_menu.html
    old style: 45.32s
    recurse: 26.19s (57.78%)
non_ascii.html
    old style: 36.60s
    recurse: 20.31s (55.49%)
line_break.html
    old style: 28.06s
    recurse: 14.07s (50.15%)
with_blank_lines.html
    old style: 39.96s
    recurse: 22.55s (56.43%)
with_conditional_comments.html
    old style: 42.32s
    recurse: 23.58s (55.73%)
with_multiple_line_comments.html
    old style: 40.68s
    recurse: 22.56s (55.47%)
with_old_doctype.html
    old style: 39.15s
    recurse: 21.03s (53.70%)

Let me know what you guys think about the code in #69

hrbonz avatar Feb 18 '14 19:02 hrbonz

By the way, non-breakable whitespaces (&nbsp;) are removed from output too.

o3bvv avatar Feb 09 '15 09:02 o3bvv