readlex
readlex copied to clipboard
Issue - double punctuation between words (em-dash usage)
The issue arises when an em-dash is used between 2 words, and where there are other punctuation marks adjacent to the em-dash. i.e. where the first part before the em-dash is exclaimed or is a question etc.
Merriam-Webter does have examples showing this particular usage of the em-dash here https://www.merriam-webster.com/grammar/em-dash-en-dash-how-to-use
Here is an "OK" sample, where the em-dash is the only character (no other adjacent punctuation):
Within its first year, Mabel and Harry had sampled all of the bakery’s offerings—all 62 items—and had also decided that the exercise was worth repeating.
𐑢𐑦𐑞𐑦𐑯 𐑦𐑑𐑕 𐑓𐑻𐑕𐑑 𐑘𐑽, ·𐑥𐑱𐑚𐑩𐑤 𐑯 ·𐑣𐑨𐑮𐑦 𐑣𐑨𐑛 𐑕𐑭𐑥𐑐𐑩𐑤𐑛 𐑷𐑤 𐑝 𐑞 𐑚𐑱𐑒𐑼𐑦𐑟 𐑪𐑓𐑼𐑦𐑙𐑟—𐑷𐑤 62 𐑲𐑑𐑩𐑥𐑟—𐑯 𐑣𐑨𐑛 𐑷𐑤𐑕𐑴 𐑛𐑦𐑕𐑲𐑛𐑩𐑛 𐑞𐑨𐑑 𐑞 𐑧𐑒𐑕𐑼𐑕𐑲𐑟 𐑢𐑪𐑟 𐑢𐑻𐑔 𐑮𐑦𐑐𐑰𐑑𐑦𐑙.
Here is a particular bad sample from Alice:
She waited for some time without hearing anything more: at last came a rumbling of little cartwheels, and the sound of a good many voices all talking together: she made out the words: “Where’s the other ladder?—Why, I hadn’t to bring but one; Bill’s got the other—Bill! fetch it here, lad!—Here, put ’em up at this corner—No, tie ’em together first—they don’t reach half high enough yet—Oh! they’ll do well enough; don’t be particular—Here, Bill! catch hold of this rope—Will the roof bear?—Mind that loose slate—Oh, it’s coming down! Heads below!” (a loud crash)—“Now, who did that?—It was Bill, I fancy—Who’s to go down the chimney?—Nay, I shan’t! You do it!—That I won’t, then!—Bill’s to go down—Here, Bill! the master says you’re to go down the chimney!”
𐑖𐑰 𐑢𐑱𐑑𐑩𐑛 𐑓 𐑕𐑳𐑥 𐑑𐑲𐑥 𐑢𐑦𐑞𐑬𐑑 𐑣𐑽𐑦𐑙 𐑧𐑯𐑦𐑔𐑦𐑙 𐑥𐑹: 𐑨𐑑 𐑤𐑭𐑕𐑑 𐑒𐑱𐑥 𐑩 𐑮𐑳𐑥𐑚𐑤𐑦𐑙 𐑝 𐑤𐑦𐑑𐑩𐑤 𐑒𐑸𐑑𐑢𐑰𐑤𐑟, 𐑯 𐑞 𐑕𐑬𐑯𐑛 𐑝 𐑩 𐑜𐑫𐑛 𐑥𐑧𐑯𐑦 𐑝𐑶𐑕𐑩𐑟 𐑷𐑤 𐑑𐑷𐑒𐑦𐑙 𐑑𐑩𐑜𐑧𐑞𐑼: 𐑖𐑰 𐑥𐑱𐑛 𐑬𐑑 𐑞 𐑢𐑻𐑛𐑟: «𐑢𐑺𐑟 𐑞 𐑳𐑞𐑼 ladder?—𐑢𐑲, 𐑲 𐑣𐑨𐑛𐑩𐑯𐑑 𐑑 𐑚𐑮𐑦𐑙 𐑚𐑳𐑑 𐑢𐑳𐑯; ·𐑚𐑦𐑤𐑟 𐑜𐑪𐑑 𐑞 𐑳𐑞𐑼—·𐑚𐑦𐑤! 𐑓𐑧𐑗 𐑦𐑑 𐑣𐑽, lad!—𐑣𐑽, 𐑐𐑫𐑑 𐑩𐑥 𐑳𐑐 𐑨𐑑 𐑞𐑦𐑕 𐑒𐑹𐑯𐑼—𐑯𐑴, 𐑑𐑲 𐑩𐑥 𐑑𐑩𐑜𐑧𐑞𐑼 𐑓𐑻𐑕𐑑—𐑞𐑱 𐑛𐑴𐑯𐑑 𐑮𐑰𐑗 𐑣𐑭𐑓 𐑣𐑲 𐑦𐑯𐑳𐑓 𐑘𐑧𐑑—𐑴! 𐑞𐑱𐑤 𐑛𐑵 𐑢𐑧𐑤 𐑦𐑯𐑳𐑓; 𐑛𐑴𐑯𐑑 𐑚𐑰 𐑐𐑼𐑑𐑦𐑒𐑘𐑩𐑤𐑼—𐑣𐑽, ·𐑚𐑦𐑤! 𐑒𐑨𐑗 𐑣𐑴𐑤𐑛 𐑝 𐑞𐑦𐑕 𐑮𐑴𐑐—𐑢𐑦𐑤 𐑞 𐑮𐑵𐑓 bear?—𐑥𐑲𐑯𐑛 𐑞𐑨𐑑 𐑤𐑵𐑕 𐑕𐑤𐑱𐑑—𐑴, 𐑦𐑑𐑕 𐑒𐑳𐑥𐑦𐑙 𐑛𐑬𐑯! 𐑣𐑧𐑛𐑟 𐑚𐑦𐑤𐑴!» (𐑩 𐑤𐑬𐑛 crash)—»𐑯𐑬, 𐑣𐑵 𐑛𐑦𐑛 that?—𐑦𐑑 𐑢𐑪𐑟 ·𐑚𐑦𐑤, 𐑲 𐑓𐑨𐑯𐑕𐑦—𐑣𐑵𐑟 𐑑 𐑜𐑴 𐑛𐑬𐑯 𐑞 chimney?—𐑯𐑱, 𐑲 𐑖𐑭𐑯𐑑! 𐑿 𐑛𐑵 it!—𐑞𐑨𐑑 𐑲 𐑢𐑴𐑯𐑑, then!—·𐑚𐑦𐑤𐑟 𐑑 𐑜𐑴 𐑛𐑬𐑯—𐑣𐑽, ·𐑚𐑦𐑤! 𐑞 𐑥𐑭𐑕𐑑𐑼 𐑕𐑧𐑟 𐑿𐑼 𐑑 𐑜𐑴 𐑛𐑬𐑯 𐑞 𐑗𐑦𐑥𐑯𐑦!»
I am aware of the issue but thank you for raising it. It's a problem with the underlying text tagging library (SpaCy). I have tried developing workarounds but so far without success. But I'll look into it further.
Huh.... I thought I had posted a suggested "dirty" fix but I don't see it here in the chat. Work-around is to pad any em-dashes with spaces before conversion, then removing the spaces after conversion. Though this is simple enough to do to my source text before running it through latin2shaw.
def latin2shaw(text):
+ text = text.replace('—', ' — ')
...
+ text_shaw = text_shaw.replace(' — ', '—')
return text_shaw
Confirmed that re-running latin2shaw with the above 2 lines added fixes it:
𐑖𐑰 𐑢𐑱𐑑𐑩𐑛 𐑓 𐑕𐑳𐑥 𐑑𐑲𐑥 𐑢𐑦𐑞𐑬𐑑 𐑣𐑽𐑦𐑙 𐑧𐑯𐑦𐑔𐑦𐑙 𐑥𐑹: 𐑨𐑑 𐑤𐑭𐑕𐑑 𐑒𐑱𐑥 𐑩 𐑮𐑳𐑥𐑚𐑤𐑦𐑙 𐑝 𐑤𐑦𐑑𐑩𐑤 𐑒𐑸𐑑𐑢𐑰𐑤𐑟, 𐑯 𐑞 𐑕𐑬𐑯𐑛 𐑝 𐑩 𐑜𐑫𐑛 𐑥𐑧𐑯𐑦 𐑝𐑶𐑕𐑩𐑟 𐑷𐑤 𐑑𐑷𐑒𐑦𐑙 𐑑𐑩𐑜𐑧𐑞𐑼: 𐑖𐑰 𐑥𐑱𐑛 𐑬𐑑 𐑞 𐑢𐑻𐑛𐑟: «𐑢𐑺𐑟 𐑞 𐑳𐑞𐑼 𐑤𐑨𐑛𐑼?—𐑢𐑲, 𐑲 𐑣𐑨𐑛𐑩𐑯𐑑 𐑑 𐑚𐑮𐑦𐑙 𐑚𐑳𐑑 𐑢𐑳𐑯; ·𐑚𐑦𐑤𐑟 𐑜𐑪𐑑 𐑞 𐑳𐑞𐑼—·𐑚𐑦𐑤! 𐑓𐑧𐑗 𐑦𐑑 𐑣𐑽, 𐑤𐑨𐑛!—𐑣𐑽, 𐑐𐑫𐑑 𐑩𐑥 𐑳𐑐 𐑨𐑑 𐑞𐑦𐑕 𐑒𐑹𐑯𐑼—𐑯𐑴, 𐑑𐑲 𐑩𐑥 𐑑𐑩𐑜𐑧𐑞𐑼 𐑓𐑻𐑕𐑑—𐑞𐑱 𐑛𐑴𐑯𐑑 𐑮𐑰𐑗 𐑣𐑭𐑓 𐑣𐑲 𐑦𐑯𐑳𐑓 𐑘𐑧𐑑—𐑴! 𐑞𐑱𐑤 𐑛𐑵 𐑢𐑧𐑤 𐑦𐑯𐑳𐑓; 𐑛𐑴𐑯𐑑 𐑚𐑰 𐑐𐑼𐑑𐑦𐑒𐑘𐑩𐑤𐑼—𐑣𐑽, ·𐑚𐑦𐑤! 𐑒𐑨𐑗 𐑣𐑴𐑤𐑛 𐑝 𐑞𐑦𐑕 𐑮𐑴𐑐—𐑢𐑦𐑤 𐑞 𐑮𐑵𐑓 𐑚𐑺?—𐑥𐑲𐑯𐑛 𐑞𐑨𐑑 𐑤𐑵𐑕 𐑕𐑤𐑱𐑑—𐑴, 𐑦𐑑𐑕 𐑒𐑳𐑥𐑦𐑙 𐑛𐑬𐑯! 𐑣𐑧𐑛𐑟 𐑚𐑦𐑤𐑴!» (𐑩 𐑤𐑬𐑛 𐑒𐑮𐑨𐑖)—«𐑯𐑬, 𐑣𐑵 𐑛𐑦𐑛 𐑞𐑨𐑑?—𐑦𐑑 𐑢𐑪𐑟 ·𐑚𐑦𐑤, 𐑲 𐑓𐑨𐑯𐑕𐑦—𐑣𐑵𐑟 𐑑 𐑜𐑴 𐑛𐑬𐑯 𐑞 𐑗𐑦𐑥𐑯𐑦?—𐑯𐑱, 𐑲 𐑖𐑭𐑯𐑑! 𐑿 𐑛𐑵 𐑦𐑑!—𐑞𐑨𐑑 𐑲 𐑢𐑴𐑯𐑑, 𐑞𐑧𐑯!—·𐑚𐑦𐑤𐑟 𐑑 𐑜𐑴 𐑛𐑬𐑯—𐑣𐑽, ·𐑚𐑦𐑤! 𐑞 𐑥𐑭𐑕𐑑𐑼 𐑕𐑧𐑟 𐑿𐑼 𐑑 𐑜𐑴 𐑛𐑬𐑯 𐑞 𐑗𐑦𐑥𐑯𐑦!»
Yes, this should work, but it will (I think) mean that if the original text had spaces on either side of the dashes, they will be removed. I believe there is a way to tell the SpaCy tagger to recognise a dash next to punctuation as punctuation but I have so far struggled to make it work. This is what I was thinking to look into again. If that fails, I'll add your suggested workaround.