apertium-apy
apertium-apy copied to clipboard
Chained translation accumulates unknown word marks
e.g.
meow (en->es) *meow *meow (es->fr) **meow
so
meow (en->fr) **meow
instead of
meow (en->fr) *meow
@unhammer ideas (aside from manually removing the marks)?
I think we'll just have to regex them away (or, into one), like the "remove error marks" thing already does.
(Ideally, lt-proc could switch on-the-fly between marking and non-marking using some stream-signal. I'd rather not start separate pipelines for with and without marks; that sounds like more complex if-then's and memory usage.)
@shardulc, could you take care of this? The regex for error marks is already in the APy code iirc.
44783f93 fixes this, and is only a three-line change after #43 is merged. Not opening a PR right now because all previous commits for chained translation show up too.
@shardulc there are other unknown word marks other than *, such as #. There should be a regex floating around somewhere in APy or html-tools that is more comprehensive.
@sushain97 I took the one in that commit directly from here, which only has the asterisks. Is a different regex used anywhere?
Hm... perhaps not.
https://github.com/goavki/streamparser/blob/master/streamparser.py#L28-L38
In released pairs, we shouldn't have #
(if the language data was completely testvoc'd), so I don't think we should worry about those.
Can i work on the issue??
@SAP-20, you don't have to ask for that. If you want to fix it, just fix it and submit a PR.