Greg Tatum issues

Results 204 issues of


                                            Greg Tatum

num_mismatch discards some useful entries

I'm checking out this rule, and found some entries that were discarded which seemed valid to me. Mostly punctuation seems to be getting in the way. | English | Spanish...

bug

Cutting off internet during download leaves the download in a broken state.

I don't see that I can cancel it. Even stopping / starting opuscleaner leaves it in the dimmed out state. I'm assuming I'll have to go in and try to...

Show the download percentage in the UI

It would be nice to see how much time is left on these bigger datasets.

Normalize language tags

In the data filter tab, it would be nice to normalize the language tags to something like the BCP 47 language tag standard.

Ensure components::Bag will always generate a result

I'm modifying this big to be a bit more subtle, which is to ensure that every components::Bag returns a result. The appendItems support is one way of doing that, but...

help wanted

T-core

C-datetime

S-medium

Alignments are not updated for the PrefixModifier

The alignments are just passed through. For them to be valid for using with guided alignments, they will also need to use a custom tokenizer. From the class docs: >...

Merge sentences produces incorrect alignments when used with SentencePiece

In the merge sentences modifiers, it uses whitespace tokenization: https://github.com/hplt-project/OpusTrainer/blob/9ec77d3745823f9e05016700938e6b2ffbb770e0/src/opustrainer/modifiers/merge.py#L12-L17 And then counts the tokens to perform offsetting for the alignments: https://github.com/hplt-project/OpusTrainer/blob/9ec77d3745823f9e05016700938e6b2ffbb770e0/src/opustrainer/modifiers/merge.py#L28-L31 However, for non-whitespace segmented languages, and for training...

Greg Tatum