MT-Preparation
MT-Preparation copied to clipboard
train_dev_test split: use bilingual file to create df
Same issue and similar solution as in https://github.com/ymoslem/MT-Preparation/pull/2, but this time for both scripts filter.py and train_dev_test_split.py
Here is how I implemented the change in the notebook 1-NMT-Data-Processing:
# Filter the dataset
!paste UN.en-fr.fr UN.en-fr.en > UN.en-fr
# Arguments: bilingual file, source file, target file, source language, target language
!python3 MT-Preparation/filtering/filter.py UN.en-fr fr en
# Split the dataset into training set, development set, and test set
!paste UN.en-fr.fr-filtered.fr.subword UN.en-fr.en-filtered.en.subword > UN.en-fr.filtered.subword
# Development and test sets should be between 1000 and 5000 segments (here we chose 2000)
!python MT-Preparation/train_dev_split/train_dev_test_split.py 2000 2000 UN.en-fr.filtered.subword fr en