MT-Preparation train_dev_test split: use bilingual file to create df

train_dev_test split: use bilingual file to create df

Open OrianeN opened this issue 2 years ago • 0 comments

Same issue and similar solution as in https://github.com/ymoslem/MT-Preparation/pull/2, but this time for both scripts filter.py and train_dev_test_split.py

Here is how I implemented the change in the notebook 1-NMT-Data-Processing:

# Filter the dataset
!paste UN.en-fr.fr UN.en-fr.en > UN.en-fr
# Arguments: bilingual file, source file, target file, source language, target language
!python3 MT-Preparation/filtering/filter.py UN.en-fr fr en

# Split the dataset into training set, development set, and test set
!paste UN.en-fr.fr-filtered.fr.subword UN.en-fr.en-filtered.en.subword > UN.en-fr.filtered.subword
# Development and test sets should be between 1000 and 5000 segments (here we chose 2000)
!python MT-Preparation/train_dev_split/train_dev_test_split.py 2000 2000 UN.en-fr.filtered.subword fr en

Sep 08 '22 09:09 OrianeN

MT-Preparation MT-Preparation copied to clipboard

train_dev_test split: use bilingual file to create df

MT-Preparation
MT-Preparation copied to clipboard