Too much blankspace gets stripped

Open hugopeixoto opened this issue 5 years ago • 1 comments

Coming from https://github.com/forem/forem/issues/8457, I think that ReverseMarkdown strips too much blankspace in some scenarios. Here's a failing test:

  it 'keeps whitespace surrounding links' do
    result = ReverseMarkdown.convert("a\n<a href='1'>link</a>\nis good\nbut blankspace is better")
    expect(result).to eq "a [link](1) is good but blankspace is better\n\n"
  end

The output is a[link](1)is good but blankspace is better\n\n. This happens because the text converter calls remove_border_newlines, and the fact that the middle line is a link means that it will be its own nokogiri node, and the three nodes will be joined with no whitespace. I tried changing remove_border_lines to squeeze instead of removing everything, but this doesn't work: it keeps whitespace in scenarios where it shouldn't:

first<p>second</p>third becomes first\n\nsecond\n\n third\n\n.

I still want to investigate this further, but I decided to post this now to share my findings.

Sep 15 '20 03:09 hugopeixoto

I was curious as to why this didn't happen with <strong> tags. The following test does succeed:

it 'keeps whitespace surrounding bolds' do
  result = ReverseMarkdown.convert("a\n<strong>potato</strong>\nis good\nbut blankspace is better")
  expect(result).to eq "a **potato** is good but blankspace is better\n\n"
end

After some puts debuggering, I see that this one also gets its blankspace trimmed in the first pass, but then cleaner.tidy adds it back again in clean_tag_borders:

converters.convert: "\n\na**potato**is good but blankspace is better\n\n"
cleaner.tidy: "a **potato** is good but blankspace is better\n\n"

So it seems that we're losing information during convert, and assuming that ** would always be preceeded by a blankspace, which gets added in tidy (clean_tag_borders). The same technique wouldn't work for links, since the surrounding blankspace doesn't need to be there for markdown links.

My previous attempt of changing remove_border_lines to use squeeze doesn't work because it would break this test:

it 'squeezes whitespace when using paragraph tags' do
  result = ReverseMarkdown.convert("first paragraph\n<p>second paragraph</p>\nthird\nin separate lines")
  expect(result).to eq "first paragraph\n\nsecond paragraph\n\nthird in separate lines\n\n"
end

with the following diff:

 first paragraph

 second paragraph

-third in separate lines
+ third in separate lines

Sep 15 '20 12:09 hugopeixoto