masakhane-mt
masakhane-mt copied to clipboard
amend drop duplicate behaviour in starter notebook
Changed drop duplicate behaviour to remove rows only when source AND target text are duplicates. Allowing for instances when source text may have multiple valid translations
Check out this pull request on
You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.
Let's hold off on merging this one until we've discussed @dwhitena's ideas
I think that taking 100 of the duplicates and getting a isiZulu/isiXhosa speaker to review them would be ideal. We work with an amazing isiZulu linguist if you'd like an expert to check if the duplicates are valid translations. Let me know and I'll do an intro email!
@jaderabbit and @Ari-Ramkilowan, thanks for the PR and discussion here. My ideas are the following:
If we are able to get human review of the conflicting translations, that would be ideal. @jaderabbit might know how feasible this is, but it seems like it may be possible based on the above comments.
If we can't get human supervision, we try something like fast-align
to score sentence pairs and weed out bad pairs, then the removal of conflicting translations is probably moot. The downside to this sort of approach is that it is rather slow to create and run this language model, so we may only want to run it for conflicting translations for language pairs where human review isn't possible.
Any other thoughts?
Thanks for adding the additional files @Ari-Ramkilowan! Do you have a link to the checkpoint of a trained model?