stopes Efficient mining using all the available computational resources

Hello,

I have two monolingual datasets with about 1M sentences each and I would like to mine bitext from them. For this, I am following the quickstart and modifying demo.yml according to my needs. Here is what it looks like at the moment:

# @package _global_

# configure the launcher, this one will
# decide how each step of the pipeline gets executed
launcher:
  # `local` means that you will run all the steps sequentially on
  # your local computer. You can also use `slurm` if you have a slurm cluster
  # setup, in which case paralell jobs will be submitted when possible.
  cluster: local
  # we don't need to set this if we aren't using slurm
  partition: null
  # To improve resilience and make iteration faster, stopes caches the results of
  # each steps of the pipeline. Set a fixed directory here if you want to
  # leverage caching.
  cache:
    caching_dir: /path/to/my/global_mining_cache

# you will need to set this on the CLI to point to where
# the demo dir is (after running demo/mining/prepare.sh)
demo_dir: ???

# where will the data go, `.` is the current run directory (auto generated by
# hydra to be unique for each run)
output_dir: /path/to/my/results
# where to find models and vocab, this is what `prepare.sh` downloaded
model_dir: ${demo_dir}/models
vocab_dir: ${demo_dir}/models

mine_threshold: 1.06

embedding_sample:
  sample_shards: False

# If file has more total lines than max_shard_size, 
# it will be automatically split into smaller shards
max_shard_size: 500000

embed_text:
  config:
    encode:
      config:
        requirements:
          nodes: 1
          tasks_per_node: 1
          gpus_per_node: 1
          cpus_per_task: 4
          timeout_min: 2880
    preprocess:
      lowercase: true
      normalize_punctuation: true
      remove_non_printing_chars: true
      deescape_special_chars: true

train_index:
  config:
    use_gpu: True

# Setup some of the steps, using GPU for populate_index makes
# it a lot faster, but if you don't have one, it's ok.
populate_index:
  config:
    use_gpu: True

mine_indexes:
  config:
    knn_dist: 16
    src_k: 16
    tgt_k: 16
    k_extract: 1
    margin_type: ratio
    mine_type: union
    sort_neighbors: false
    margin_norm: mean
    num_probe: 128
    gpu_type: "fp16-shard"
    mine_threshold: ${mine_threshold}

calculate_distances:
  config:
    gpu_memory_gb: 16
    gpu_type: "fp16-shard"
    num_probe: 128
    knn: 16
    save_dists_as_fp16: true
    batch_size: 8192
    normalize_query_embeddings: true

# Provides info about the data. A lot of this is used to generate nice output
# file names.
data:
  data_version: V32m
  iteration: 1
  data_shard_dir: ${demo_dir}
  shard_type: text
  bname: demo_wmt22
  # shard_glob tells us where to find the language files `{lang}` will be
  # replaced by the language code from src and tgt
  shard_glob: ${.data_shard_dir}/{lang}.gz
  # we need to know the number of lines in each file, this is computed in
  # prepare.sh and this tells the pipeline where to find the files with this
  # info
  nl_file_template: "{lang}.nl"

# for each language we support, specify where the laser encoder is and where the
# spm model/vocab can be found. In our case, we have custom laser2/3 encoders
# for all languages. But for most languages we reuse the same spm/vocab, so we
# use hydra to share this value.
default_spm: ${model_dir}/laser2.spm
default_vocab: ${model_dir}/laser2.cvocab
default_encoder: ${model_dir}/laser2.pt

To start mining, I run:

python -m stopes.pipelines.bitext.global_mining_pipeline src_lang=lang1 tgt_lang=lang2 demo_dir=/path/to/my/demo_dir +preset=demo-test embed_text=laser2

With this configuration, the embedding step utilizes only 1 GPU. I have tried to modify the embed_text config options to make it run on multiple GPUs but unsuccessfully. Is there a way to run the embedding step on multiple gpus?

In general, given a machine with n_gpus and n_cpus what are the config parameters to modify to make full use of the available computational resources?

Thank you in advance,

Z

Feb 22 '23 11:02 ZenBel

Hello,

I've never tried this, in our setup we use slurm for the scheduler which gives us containers with the right amount of GPU/CPU per jobs. When you run in local like you are doing, there will be multiple processes for each shard that you feed in, but there is nothing special to spread each process on its own local GPU.

You could try to set the cuda device at the beginning of the embedding module based on the iteration value maybe. Not sure if that would work. If you have an environment where you can test this, we would love a PR to do this.

Mar 08 '23 12:03 Mortimerp9

Thanks for the reply. So, just to be clear, when using stopes with cluster: local only 1 GPU is used even though the machine has more?

Indeed, adding the following lines at the beginning of the run method of PrepocessEncoderModule allows to use all the gpus for embedding the data:

    def run(
        self,
        iteration_value: tp.Optional[tp.Any] = None,
        iteration_index: int = 0,
    ):
        # Setting cuda visible device for each shard
        num_gpus = self.config.encode.config.requirements.gpus_per_node
        try:
            rank  = int(iteration_value.split(".")[-2]) % num_gpus
            torch.cuda.set_device(rank)
        except ValueError:
            pass
[...]

I can open a PR if you like.

Best,

Z

Mar 10 '23 15:03 ZenBel

thanks for checking that it works. You probably want to use iteration_index instead of relying on the file name parsing (iteration_value), but this seems great. If you can, send a PR.

Mar 12 '23 18:03 Mortimerp9

You can alternatively setup a slurm cluster, on your machine - I just made my 2x RTX 3090 Ubuntu machine into a "slurm cluster"

Aug 21 '23 08:08 gordicaleksa