subgenus = Incertae sedis then name string doesn't parse, also strange looking quality values
Raw data (unparsed): beulah-first-5000-name-strings-unparsed.csv
Modified GNParsed Data Set: beulath-taxonnames-gnparsed-first-5000-rows.txt
- added family column, value = Carabidae
- opened file in Notepad ++
- changed CRLF line endings to UNIX (LF) (b/c upload to TW batch requires this)
Noticed
-
the Quality values look strange? Maybe on import into Excel, I need to select a certain data type for this field?
-
see also line 11 above where the value
pseudoflavipesappears changed topseudoflavipe0sinCanonicalFullcolumn (also lines 116, 117)- don't know where that
0comes from
- don't know where that
-
see also Author Year
leadingandtrailing0. Not sure where they are coming from either -
More
0issues (and delimiters issue?), origin uncertain -
Some names did not parse. (Not sure why). See screenshot next. Maybe because all these names have subgenus =
(Incertae sedis)and GN doesn't recognize this value at this rank?
- In general, subgenus is missing from all parsed values.
Maybe in future?
- option to parse (further atomize) down to lowest rank provided
Thanks @debpaul, interesting
-
Looks like I am missing case where subgenus is
Inserte cedis. I do agree, that names like these should be parsed. I will make a separate issue about it. -
Strange results in quality is an artefact of postprocessing, it is impossible to get quality 10. The '0' in the middle of Canonical also seems to be postprocessing problem. Try to run this name by itself in parser
-
Subgenus is provided, just not in the CSV format. If you pick JSON format on the web UI, you will see the subgenus results.
@dimus thanks! I did note that on import to Excel, it asks about modifying or removing leading zeroes. Note sure why. I told it not to modify the data. I'll test again as you suggest.
this is what I get without preprocessing;
@debpaul can you also try Libreoffice? It consistently gives me better results than Excel