start/end position slippage
I may be misunderstanding the output of gnfinder online, but it seems that positions in the Start/End columns sometimes slip. On example below, Urtica has start-end coordinates 322-328, but should be 322-327; Ceratodon has 1305-1314 instead of 1307-1315, etc.
(Also, � in the output, but that can be a problem on my side.)
Options selected:
- Freeform text
- Output format: TSV
- All occurrences
- Show Ambiguous Uninomials
- Verification data sources: GBIF, Index Fungorum, WFO
Input:
Substrate.Text
Ruderaal terrein.
Op Sambucus.
Naaldhouttak
Ranunculus repens.
Berkenbos op zand.
Op Larixstomp in Larixbos.
Op Larixstomp in Larixbos.
Ranunculus repens.
Op kale turf.
Op houtstrooisel (naaldhout).
in essenhakhoutbosje op grond.
On Crataegus.
Op hout.
Gemengd bos, dode Urtica stengels.
Oude begraafplaats, bodem.
Juniperus-struweel, Noord-mos (Dicranum scoparium, Deschampsia flexuosa).
Boomstronk, eik.
Oud, hoog opgaand, donker Douglassparrenbos, op rottende stronk.
On rather dry, humose sand along road through deciduous forest.
zandgebied ingeplant met loofboompjes.
Jeneverbesstruweel.
In der Laubstreu.
Open plek in een Juniperus-struweel, stuifzand begroeid met Dicranum scoparium.
Gemengd bos, op eikenhout.
Nat bosje, onder wilg in gras en blad samen met Hypholoma myosotis.
Eiken-beukenbos.
Op dode tak op de grond.
Open grasvegetatie op ± lemig zand.
Tussen afgevallen Fagusblad.
Hulstrijk eikenbos.
In pomis putridis et floribus emarcidis.
Loam pits.
Op bladstrooisel.
Regelmatig betreden, kortgrazige, mosrijke, schrale vegetatie op droge zand -op-leembodem.
In schrale, droge, zandige vlakke, soms bereden wegberm op zwak kalkhoudend zand (pH = ± 6.7), tussen Ceratodon en Polytrichum piliferum.
On Prunus.
Pinus sylvestris/nigra plantation.
Output:
Index Verbatim Name Start End
0 Sambucus. Sambucus 38 47
1 Ranunculus repens. Ranunculus repens 63 81
2 Ranunculus repens. Ranunculus repens 165 183
3 Crataegus. Crataegus 282 292
4 Urtica Urtica 322 328
5 (Dicranum scoparium, Dicranum scoparium 410 430
6 Deschampsia flexuosa). Deschampsia flexuosa 431 453
7 Dicranum scoparium. Dicranum scoparium 760 779
8 Nat Nat 815 818
9 Hypholoma myosotis. Hypholoma myosotis 863 882
10 Ceratodon Ceratodon 1305 1314
11 Polytrichum piliferum. Polytrichum piliferum 1318 1340
12 Prunus. Prunus 1345 1352
13 Pinus sylvestris/nigra Pinus sylvestris�nigra 1354 1376
There is an exclusive method to get a slice of an array, so cat would be [0:3] and inclusive method where 'cat' would be [0:2].
Most common is exclusive method, this is why there is one additional character given in the offsets.
--- the rest is not directly relevant to your issue, but describes some issues I encountered when figuring out offsets
Do you know what is the encoding of the input? GNfinder converts texts to UTF-8 and provides offsets according to that converted text instead of the input. That would be my first guess. The � appears when conversion of a character to UTF-8 failed or when finder encounters a character that should never be inside of a name (in this case sylverstris/nigra was considered as one word). On top of different encodings there is also two different ways to generate diacritics in UTF-8 (single character and character combination ) These are also normalized to 'single character' UTF-8 encoding
There is an option in gnfinder command-line app to return the normalized text that was used for the name-finding. I think I need to add this option to web UI as well
There is another option, where offset it calculated by bytes instead of UTF-8 runes. Probably it also should be added to web UI
Thanks for the fast answer!
It is good to be aware of inclusive and exclusive methods, it explains some cases. Are they used in a mix, taxon by taxon, or one method applied to the whole query text?
Hm, diacritic conversion can be an issue... Also the problem can be in how text editors calculate position (I check in Notepad++). Encoding of my file is UTF-8, but the slippage remains and gets really noticeable on larger queries. For example, for the attached example, when using start/end coordinates, the last capture instead of Corylus. becomes nosa and Co.
Why I'm concerned with this at all, is that I tried to convert start positions to line numbers and got slightly off results. Now I see that a source of my issues is probably that I calculated by bites using dd:
cat "$START_POSITIONS" | parallel -j 8 --keep-order --tag '(dd if='$FILE_TO_PARSE' bs=1 count={1} 2>/dev/null; printf "\n") | wc -l' > "$OUTPUT"
if slippage increases, it probably means that offset in your editor is based on bytes, not on UTF-8 runes. I will add byte offset to GUI (probably next week), it would help in this case. If you are able to use gnfinder in command line, it already has bytes offset option
# to get UTF-8 intput and byte offset
echo "Pardosa moesta is a spider" |gnfinder -b -i -f compact |jq
gnfinder always uses external method for subslices