formatted names not recognized
Hi again,
Using always more your amazing tool, I went through following issue:
When names are formatted, they do not get recognized.
Here are the different inputs: input1.txt input2.txt input3.txt
and resulting outputs:
1
{ "metadata": { "date": "2020-06-02T12:34:06.580524+02:00", "gnfinderVersion": "v0.11.0", "withBayes": true, "tokensAround": 0, "language": "eng", "detectLanguage": false, "totalWords": 3, "totalCandidates": 1, "totalNames": 0 }, "names": null }
2
{ "metadata": { "date": "2020-06-02T12:35:54.584728+02:00", "gnfinderVersion": "v0.11.0", "withBayes": true, "tokensAround": 0, "language": "eng", "detectLanguage": false, "totalWords": 3, "totalCandidates": 1, "totalNames": 0 }, "names": null }
3
{
"metadata": {
"date": "2020-06-02T12:36:02.972624+02:00",
"gnfinderVersion": "v0.11.0",
"withBayes": true,
"tokensAround": 0,
"language": "eng",
"detectLanguage": false,
"totalWords": 3,
"totalCandidates": 2,
"totalNames": 1
},
"names": [
{
"cardinality": 2,
"verbatim": "Zea mays",
"name": "Zea mays",
"odds": 1.0719384060700208,
"start": 0,
"end": 8,
"annotationNomenType": "NO_ANNOT",
"annotation": "",
"verification": {
"bestResult": {
"dataSourceId": 1,
"dataSourceTitle": "Catalogue of Life",
"taxonId": "42981044",
"matchedName": "Zea mays L.",
"matchedCardinality": 2,
"matchedCanonicalSimple": "Zea mays",
"matchedCanonicalFull": "Zea mays",
"classificationPath": "Plantae|Tracheophyta|Liliopsida|Poales|Poaceae|Zea|Zea mays",
"classificationRank": "kingdom|phylum|class|order|family|genus|species",
"classificationIds": "54767868|54767869|54770228|54770238|54770244|55061565|42981044",
"matchType": "ExactCanonicalMatch"
},
"preferredResults": [
{
"dataSourceId": 1,
"dataSourceTitle": "Catalogue of Life",
"taxonId": "42981044",
"matchedName": "Zea mays L.",
"matchedCardinality": 2,
"matchedCanonicalSimple": "Zea mays",
"matchedCanonicalFull": "Zea mays",
"classificationPath": "Plantae|Tracheophyta|Liliopsida|Poales|Poaceae|Zea|Zea mays",
"classificationRank": "kingdom|phylum|class|order|family|genus|species",
"classificationIds": "54767868|54767869|54770228|54770238|54770244|55061565|42981044",
"matchType": "ExactCanonicalMatch"
},
{
"dataSourceId": 11,
"dataSourceTitle": "GBIF Backbone Taxonomy",
"taxonId": "5290052",
"matchedName": "Zea mays L.",
"matchedCardinality": 2,
"matchedCanonicalSimple": "Zea mays",
"matchedCanonicalFull": "Zea mays",
"classificationPath": "Plantae|Tracheophyta|Liliopsida|Poales|Poaceae|Zea|Zea mays",
"classificationRank": "kingdom|phylum|class|order|family|genus|species",
"classificationIds": "6|7707728|196|1369|3073|2705049|5290052",
"matchType": "ExactCanonicalMatch"
}
],
"dataSourcesNum": 25,
"dataSourceQuality": "HasCuratedSources",
"retries": 1
}
}
]
}
Do you think it is easily doable to recognize them?
Otherwise I'll have to find a way of substracting the <i> </i> and so on before submitting the test to gnfinder.
1. [<i>Zea mays</i> Linné]
2. <i>Zea mays</i> Linné
3. Zea mays Linné
1 and 2 are not found, while 3 is found.
Hm, this is a grey area to me. I see gnfinder as a tool that finds names in plain texts, other type of texts need to be converted to plain text before use.
For example it definitely does not support PDF, MS Doc, Excel spreasheets etc. Following this logic XML, HTML, JSON are marked up texts and need to be converted first to a plain text.
Hmmm ok...sad...
I thought rich text would have been ok...my bad then
Thank you for your answer!
From other side <i> tags in biological texts often indicate scientific names, so they might be a good thing to support.
Hi @dimus! Trying to clean all my old issues... has this been somehow addressed with all the work you did lately?
Shall I keep it open?
Yes, please keep it open, I did not get to it yet, was concentrated on gnverifier for a while. I do want to find a good solution for this.