sumatrapdf icon indicating copy to clipboard operation
sumatrapdf copied to clipboard

search does not consider space after colon

Open Thinkpositiv opened this issue 7 years ago • 13 comments

I have a pdf with text opened in Sumatra v3.1.2 (German, Win 10). A part of the text is another:word At other parts of the text, it is another: word (now with a space after the colon) In Sumatra search field, I enter "other: " and it finds both occurrences. For "other: word" it finds both occurrences. For "other: word" it finds both occurrences. (Obviously the space(s) after the colon are not taken into account.)

For "other : word" it does not find any occurrence. (Obviously the space(s) before the colon are taken into account.)

I was not able to find any details about the search, is this a bug or a feature? Btw: it happend with another pdf file, too.

Thinkpositiv avatar Mar 20 '18 21:03 Thinkpositiv

Just tested with a sample PDF in Sumatra 3.1.2 on English Win10 and I can confirm this. Looks like a bug but let's see what the dev says about it.

Edit: Found other bugs related to word boundaries that might be relevant: https://github.com/sumatrapdfreader/sumatrapdf/issues/26 and https://github.com/sumatrapdfreader/sumatrapdf/issues/410 (esp. see pdgessler's comment).

SumatraPeter avatar Mar 22 '18 04:03 SumatraPeter

@SumatraPeter without a sample of the original issue above I cant easily tell if new build addresses the problem are you able to retest the example you used to confirm the problem ?

GitHubRulesOK avatar Nov 16 '19 01:11 GitHubRulesOK

I no longer have the sample, but all it was was a Word doc that I inserted a few instances of "another:word", "another: word", "another :word" and "another : word" into at random, then saved as a PDF.

SumatraPeter avatar Nov 16 '19 21:11 SumatraPeter

@SumatraPeter OK current behaviour does appears unusual in that spaces are sometimes ignored thus " : " is the same as " :" and ": " and ":" thus on its own (or not) the : is the sole search character however in context "another: word" is only the same as "another:word" and differently "another :word" is the same as "another : word" which possibly has some logic in Unicode terms but not to this human:-)

GitHubRulesOK avatar Nov 16 '19 21:11 GitHubRulesOK

Note: Quotes are not to be typed.

  1. "another:word" (no space before and after colon)
  2. "another: word" (space only after colon)
  3. "another :word" (space only before colon)
  4. "another : word" (space before and after colon)

"other:wo" finds 1 (correct) and 2 (incorrect) "other: wo" finds 2 (correct) and 1 (incorrect) "other :wo" finds 3 (correct) and 4 (incorrect) "other : wo" finds 4 (correct) and 3 (incorrect)

In all cases you can see that the space before the colon is significant and taken into account, but the space after the colon is simply ignored while searching.

As for why Sumatra is ignoring both spaces in " : ", that's a further mystery! 🙄

SumatraPeter avatar Nov 16 '19 21:11 SumatraPeter

@SumatraPeter my findings are identical in that the search reports two valid at a time rather than just the 1 as expected and it makes no difference either if case is set since the colon and spaces are case-less characters, as you say why are both ignored in a 3 character string but only one space is ignored in a longer string As I see it the apparent logic is a space is always ignored after a punctuation character, you can observe exactly the same behaviour if the character is a comma and without testing I guess a full-stop.

I have some niggling recollection there was an issue searching for spaced words in Chinese but that could be a red herring. It could equally be a $ 64, 000, 000. 00 Question

GitHubRulesOK avatar Nov 16 '19 21:11 GitHubRulesOK

Either way, no document to test, no progress can be made.

kjk avatar Nov 17 '19 23:11 kjk

Also we currently have a custom text search logic. mupdf added search functionality so I'm likely to switch that code to mupdf as well.

kjk avatar Nov 17 '19 23:11 kjk

Either way, no document to test, no progress can be made.

That's very easily solved. 🙂

Perhaps it might be better to leave this open as a reminder till it is (hopefully) fixed by the new MuPDF search code?

SumatraPeter avatar Nov 18 '19 00:11 SumatraPeter

You can also put many 2 or 20 or more spaces behind the colon, all spaces are ignored.

Thinkpositiv avatar Nov 19 '19 21:11 Thinkpositiv

Today I posted an issue (#3948) describing in essence the same problem as Thinkpositiv has done here in 995. (My issue was closed. I am sorry for having overseen 995.)

Reading this thread, I do not really understand what is the current situation. Is it still planned to fix the issue? Something was mentioned about switching a code to mupdf. Or would the developer still need a document to test, as he wrote above? If this is the case, I could provide such a document showing that there is no distinction between "other:word" (without space) and "other: word" (with space) in the search results.

Thanks.

Peter-202122 avatar Dec 10 '23 20:12 Peter-202122

@Peter-202122 The core problem with searching text in a PDF is that the text is not what is visible that is just the coloured ink that comes from the binary. Now it would be difficult for an extractor (reader) to keep writing the vectors so a font is used and the numbers converted into font positions. again that hard to visualise so lets turn the binary into human numeric any one of many ways here is the sample as text.

q 0 0 0 rg BT 56.8 704.2 Td /F1 10 Tf<0102030405060708090A0B0C0D070B0E04030F060703101105120A03130A14070B0A0305060E030514080A0B03120715070616>Tj
ET Q
q 0 0 0 rg BT 56.8 670.2 Td /F1 10 Tf<1702030405060708090A0B0C030D070B0E04030F101105120A0307061518030514080A0B03120715070616>Tj
ET Q
q 0 0 0 rg BT 56.8 636.1 Td /F1 10 Tf<1902030405060708090A0B030C0D070B0E04030F101105120A030706151803130A14070B0A03120715070616>Tj
ET Q
q 0 0 0 rg BT 56.8 602.1 Td /F1 10 Tf<1A02030405060708090A0B030C030D070B0E04030F101105120A03130A14070B0A0305060E030514080A0B03120715070616>Tj
ET Q Q 
endstream
endobj

so in the above rendering 1 is 01 and . is 02 and is 03 that easy to see how in the embedded font table <01> =<0031> Coincidentally the ANSI code number for character 1 and <002E> is equivalent to .

/CMapName/Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
26 beginbfchar
<01> <0031>
<02> <002E>
<03> <0020>
<04> <0022>
<05> <0061>
<06> <006E>
<07> <006F>
<08> <0074>
<09> <0068>
<0A> <0065>
<0B> <0072>
<0C> <003A>
<0D> <0077>
<0E> <0064>
<0F> <0028>

we can also use an editor to replace those with raw binary values for the ANSI like this and it makes more sense (except I got the l wrong by calling for I) and now the codes have shifted from hex <##> to \Octal (but the map is still the same every single character is simply a binary number (or letter).

stream
/GS0 gs 1 0 0 1 56.799999 712.52002 cm BT 0 g 0 Tc 0 Tw 100 Tz 0 Tr/F0 10 Tf 0 -8.33 Td
(1. "another:word" \(no space before and after coIon\))Tj
ET /GS0 gs 1 0 0 1 -56.799999 -712.52002 cm BT
0 0 0 rg 0 Tc 0 Tw 100 Tz/F1 1 Tf 10 0 0 10 56.799999 670.200012 Tm
(\027\002\003\004\005\006\007\010\t\n\013\f\003\r\007\013\016\004\003\017\020\021\005\022\n\003\007\006\025\030\003\005\024\010\n\013\003\022\007\025\007\006\026)Tj
0 -3.41 Td
(\031\002\003\004\005\006\007\010\t\n\013\003\f\r\007\013\016\004\003\017\020\021\005\022\n\003\007\006\025\030\003\023\n\024\007\013\n\003\022\007\025\007\006\026)Tj
0 -3.4 Td
(\032\002\003\004\005\006\007\010\t\n\013\003\f\003\r\007\013\016\004\003\017\020\021\005\022\n\003\023\n\024\007\013\n\003\005\006\016\003\005\024\010\n\013\003\022\007\025\007\006\026)Tj
ET
endstream
endobj

So the problem is many characters fall outside the normal range of letters and oddly the space character is one of those considered as a control character within a PDF in effect its discounted, and heuristics need to used for word spacing.

This is why space: or :space will simply seek just for : and thus all :w are equal however for whatever reason MuPDF shows a distinction between r: and r : leading to the confusion.

GitHubRulesOK avatar Dec 10 '23 23:12 GitHubRulesOK

@GitHubRulesOK Thank you very much for taking your time to explain me the background of the issue. So I see now that because of technical reasons it is not so easy (or even impossible?) to change the current behaviour. That's a pity.

Nevertheless I will stay with Sumatra as my favorite PDF-reader, especially because its speed (in starting, in searching etc.) is excellent.

Peter-202122 avatar Dec 11 '23 06:12 Peter-202122