stopes icon indicating copy to clipboard operation
stopes copied to clipboard

Filtering pipeline produces a config with wrong lang directions

Open molokanov50 opened this issue 1 year ago • 1 comments

I want to finetune an NLLB model on my own data, so according to my vision, the task is relatively simple - to convert my dataset to fairseq format. So I started to use stopes pipelines. But, despite the directory structure of my dataset implies eng_Latn-rus_Cyrl lang direction, config.yaml at the output of the filtering pipeline lists absolutely other lang pairs. My dataset consists of 2 files (FTData is a root directory for my dataset): FTData/eng_Latn-rus_Cyrl/mycorpus.eng_Latn.gz, FTData/eng_Latn-rus_Cyrl/mycorpus.rus_Cyrl.gz. Then I run: python stopes/stopes/pipelines/filtering/scripts/populate_data_conf.py --bt-root bt --mined-data-root mined --primary-train-paths FTData --data-conf-dir ConfOutput train_primary, where bt and mined are empty directories (since I have initially only my own texts without any preprocessing), then: python stopes/stopes/pipelines/filtering/scripts/compute_length_factors.py --data-conf-dir ConfOutput --flores-path flores, where flores is also an empty dir (since I don't need any external corpora, my goal is to finetune only on my data, but --flores-path is a required param to run compute_length_factors.py, so I think I can indicate an arbitrary directory there), and lastly: python stopes/stopes/pipelines/filtering/filter.py output_dir=FTFiltered data_conf_dir=ConfOutput. My FTFiltered/config.yaml file looks as follows:

data_conf_dir: /home/molokanov/myapp3/ConfOutput
directions:
- eng_Latn-lij_Latn
- eng_Latn-scn_Latn
executor:
  cluster: local
  log_folder: executor_logs
  slurm_partition: null
output_dir: /home/molokanov/myapp3/FTFiltered
train_bt: null
train_mined: null
train_primary:
  dedup_filter:
    _target_: stopes.pipelines.filtering.filters.DedupFilter
    dedup_pairs: true
    max_source_dedup: null
    max_target_dedup: null
  excluded_corpora: null
  included_corpora:
  - nllbseed
  - tatoeba
  laser_filter: null
  length_filter:
    _target_: stopes.pipelines.filtering.filters.LengthFilter
    max_len: 1050
    max_len_ratio: 9.0
    min_len: 5
    min_src_unique_ratio: null
  lid_filter: null
  normalize_punctuation: true
  normalize_unicode: false
  toxicity_filter: null

As you can see, eng_Latn-lij_Latn and eng_Latn-scn_Latn are not contained in my dataset but I got them. In the same time, there is no eng_Latn-rus_Cyrl in my config, but this lang pair is required for me. Also, I have no understanding why nllbseed and tatoeba are mentioned as included corpora in my config.yaml.

molokanov50 avatar Mar 23 '23 07:03 molokanov50