data icon indicating copy to clipboard operation
data copied to clipboard

Inconsistent tokenization

Open MrLogarithm opened this issue 5 years ago • 1 comments

In the ED IIIb data from Girsu, the tokenization is not consistent. Examples include:

  • udu nita (P221436) vs. udu-nita (P010556)
  • ugula ki-siki-ka (P221485) vs. ugula ki siki-ka (P221319)
  • ziz2-bala-bi (P020272) vs. ziz2 bala-bi (P355602)
  • lu2 esz2 gid2 (P247610) vs. lu2 esz2-gid2 (P221317) vs. lu2-esz2-gid2 (P217545)
  • bar-bi gal2-me (P221708) vs. bar-bi-gal2-me (P221331)
  • lu2 a kum2 (P221716) vs. lu2-a-kum2 (P221333) vs. lu2 a-kum2 (P221451)
  • lu2 e2-sza3-ga-me (P020184) vs. lu2-e2-sza-ga-me (P227557)
  • ki-siki-ka me (P221316) vs. ki-siki-ka-me (P221317) vs. ki siki-ka-me (P221319)

A shell script could probably enumerate more examples.

Is there a principled way to decide which tokenizations are correct and harmonize all of the spellings?

MrLogarithm avatar Jun 22 '20 15:06 MrLogarithm

Yes, an assyriologist must look at both, make a decision and update all atf. Our Bulk upload on the site is broken right now for some obscure reason so ill try to fix it soon and then we can proceed in harmonizing those. thanks !

epageperron avatar Jun 22 '20 15:06 epageperron