masakhane-mt icon indicating copy to clipboard operation
masakhane-mt copied to clipboard

amend drop duplicate behaviour in starter notebook

Open Ari-Ramkilowan opened this issue 5 years ago • 4 comments

Changed drop duplicate behaviour to remove rows only when source AND target text are duplicates. Allowing for instances when source text may have multiple valid translations

Ari-Ramkilowan avatar Oct 11 '19 09:10 Ari-Ramkilowan

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

Let's hold off on merging this one until we've discussed @dwhitena's ideas

I think that taking 100 of the duplicates and getting a isiZulu/isiXhosa speaker to review them would be ideal. We work with an amazing isiZulu linguist if you'd like an expert to check if the duplicates are valid translations. Let me know and I'll do an intro email!

jaderabbit avatar Oct 14 '19 06:10 jaderabbit

@jaderabbit and @Ari-Ramkilowan, thanks for the PR and discussion here. My ideas are the following:

If we are able to get human review of the conflicting translations, that would be ideal. @jaderabbit might know how feasible this is, but it seems like it may be possible based on the above comments.

If we can't get human supervision, we try something like fast-align to score sentence pairs and weed out bad pairs, then the removal of conflicting translations is probably moot. The downside to this sort of approach is that it is rather slow to create and run this language model, so we may only want to run it for conflicting translations for language pairs where human review isn't possible.

Any other thoughts?

dwhitena avatar Oct 14 '19 14:10 dwhitena

Thanks for adding the additional files @Ari-Ramkilowan! Do you have a link to the checkpoint of a trained model?

juliakreutzer avatar May 10 '20 00:05 juliakreutzer