AsTeRICS-Grid icon indicating copy to clipboard operation
AsTeRICS-Grid copied to clipboard

Support for automatic elision

Open remusonline opened this issue 1 year ago • 1 comments

For some languages (e.g. french), automatic elision is not optional. For example, when using "je" + "aime", the system should transform the two words into "j'aime". There are however many rules around elision in french, not sure how it could be easily integrated in this program. Another solution is to use grid elements like "j'", but they don't connect correctly to the other words i.e. "j' aime". Note that this is all with collect cohesive text, not sure how this would be implemented with pictograms in the message window.

remusonline avatar Oct 14 '24 15:10 remusonline

Yes, we have discussed this exact example when setting up the word forms options. It's because of this that I have opened the issue #404 which would allow to set " j' " as a virtual prefix or the verb as a virtual suffix. My plan is: 1st click on "je" => verbs change to 1st person singular 2nd click on " je " <= the button changes to " j' " => clicking on the verb adds it to " j' " without a space in between, for the verbs that start with a vowel. It requires for the user to learn when to use "je" and when " j' " but some kind of automatic recognition doesn't work with the current system or at least we weren't able to think of another solution so far.

ms-mialingvo avatar Oct 14 '24 16:10 ms-mialingvo

Ooh.

So there are quick wins to do this. But I think we can solve this in a very neat way. Bring on modern pwas!

Use GiellaLT / HFST finite-state transducers, compiled to WASM for offline use in the Vue PWA. Yes. This is fun!

  • So we can create Each language pack which would contain analyser/generator data (.hfstol / .pmhfst) + metadata.
  • Web Worker runtime exposes analyse(surface), generate(lemma+tags), join(prev, next) so tokens can be rendered with correct spacing/affixes.
  • Lazy-load packs per language; cache in IndexedDB.

Roadmap for me 1. Compile demo analyser (e.g. French) to WASM and test je/j’ + verb generation. 2. Implement token joiner pipeline that calls HFST to decide spacing/apostrophes. 3. Add packs for English/Spanish/German; support prefix/suffix grid elements. 4. Later: optional CG disambiguation for more accurate POS handling. 5. Provide authoring/debug tool to inspect analyser outputs and join rules.

Really plan would do this as a standalone repo. It would be neat. There's a paper in this

willwade avatar Aug 18 '25 21:08 willwade