Greg Tatum

Results 204 issues of Greg Tatum

I'm checking out this rule, and found some entries that were discarded which seemed valid to me. Mostly punctuation seems to be getting in the way. | English | Spanish...

bug

I don't see that I can cancel it. Even stopping / starting opuscleaner leaves it in the dimmed out state. I'm assuming I'll have to go in and try to...

It would be nice to see how much time is left on these bigger datasets.

In the data filter tab, it would be nice to normalize the language tags to something like the BCP 47 language tag standard.

I'm modifying this big to be a bit more subtle, which is to ensure that every components::Bag returns a result. The appendItems support is one way of doing that, but...

help wanted
T-core
C-datetime
S-medium

The alignments are just passed through. For them to be valid for using with guided alignments, they will also need to use a custom tokenizer. From the class docs: >...

In the merge sentences modifiers, it uses whitespace tokenization: https://github.com/hplt-project/OpusTrainer/blob/9ec77d3745823f9e05016700938e6b2ffbb770e0/src/opustrainer/modifiers/merge.py#L12-L17 And then counts the tokens to perform offsetting for the alignments: https://github.com/hplt-project/OpusTrainer/blob/9ec77d3745823f9e05016700938e6b2ffbb770e0/src/opustrainer/modifiers/merge.py#L28-L31 However, for non-whitespace segmented languages, and for training...

In Marian, invalid alignments leads to a crash, as the index bounds for tokens is not checked. This breaks training. Plus, if alignments are generated incorrectly on the OpusTrainer side,...

There are no guarantees that the alignments are correct in the NoiseModifer. It generates random tokens through the `get_random_unicode_words`, but these could be tokenized as combined words. For instance, if...

An example is here (id-en), where the task timed out. https://firefox-ci-tc.services.mozilla.com/tasks/NWpySVWkT2SVBY3AOkDaIQ