Efficient mining using all the available computational resources
Hello,
I have two monolingual datasets with about 1M sentences each and I would like to mine bitext from them. For this, I am following the quickstart and modifying demo.yml according to my needs. Here is what it looks like at the moment:
# @package _global_
# configure the launcher, this one will
# decide how each step of the pipeline gets executed
launcher:
# `local` means that you will run all the steps sequentially on
# your local computer. You can also use `slurm` if you have a slurm cluster
# setup, in which case paralell jobs will be submitted when possible.
cluster: local
# we don't need to set this if we aren't using slurm
partition: null
# To improve resilience and make iteration faster, stopes caches the results of
# each steps of the pipeline. Set a fixed directory here if you want to
# leverage caching.
cache:
caching_dir: /path/to/my/global_mining_cache
# you will need to set this on the CLI to point to where
# the demo dir is (after running demo/mining/prepare.sh)
demo_dir: ???
# where will the data go, `.` is the current run directory (auto generated by
# hydra to be unique for each run)
output_dir: /path/to/my/results
# where to find models and vocab, this is what `prepare.sh` downloaded
model_dir: ${demo_dir}/models
vocab_dir: ${demo_dir}/models
mine_threshold: 1.06
embedding_sample:
sample_shards: False
# If file has more total lines than max_shard_size,
# it will be automatically split into smaller shards
max_shard_size: 500000
embed_text:
config:
encode:
config:
requirements:
nodes: 1
tasks_per_node: 1
gpus_per_node: 1
cpus_per_task: 4
timeout_min: 2880
preprocess:
lowercase: true
normalize_punctuation: true
remove_non_printing_chars: true
deescape_special_chars: true
train_index:
config:
use_gpu: True
# Setup some of the steps, using GPU for populate_index makes
# it a lot faster, but if you don't have one, it's ok.
populate_index:
config:
use_gpu: True
mine_indexes:
config:
knn_dist: 16
src_k: 16
tgt_k: 16
k_extract: 1
margin_type: ratio
mine_type: union
sort_neighbors: false
margin_norm: mean
num_probe: 128
gpu_type: "fp16-shard"
mine_threshold: ${mine_threshold}
calculate_distances:
config:
gpu_memory_gb: 16
gpu_type: "fp16-shard"
num_probe: 128
knn: 16
save_dists_as_fp16: true
batch_size: 8192
normalize_query_embeddings: true
# Provides info about the data. A lot of this is used to generate nice output
# file names.
data:
data_version: V32m
iteration: 1
data_shard_dir: ${demo_dir}
shard_type: text
bname: demo_wmt22
# shard_glob tells us where to find the language files `{lang}` will be
# replaced by the language code from src and tgt
shard_glob: ${.data_shard_dir}/{lang}.gz
# we need to know the number of lines in each file, this is computed in
# prepare.sh and this tells the pipeline where to find the files with this
# info
nl_file_template: "{lang}.nl"
# for each language we support, specify where the laser encoder is and where the
# spm model/vocab can be found. In our case, we have custom laser2/3 encoders
# for all languages. But for most languages we reuse the same spm/vocab, so we
# use hydra to share this value.
default_spm: ${model_dir}/laser2.spm
default_vocab: ${model_dir}/laser2.cvocab
default_encoder: ${model_dir}/laser2.pt
To start mining, I run:
python -m stopes.pipelines.bitext.global_mining_pipeline src_lang=lang1 tgt_lang=lang2 demo_dir=/path/to/my/demo_dir +preset=demo-test embed_text=laser2
With this configuration, the embedding step utilizes only 1 GPU. I have tried to modify the embed_text config options to make it run on multiple GPUs but unsuccessfully. Is there a way to run the embedding step on multiple gpus?
In general, given a machine with n_gpus and n_cpus what are the config parameters to modify to make full use of the available computational resources?
Thank you in advance,
Z
Hello,
I've never tried this, in our setup we use slurm for the scheduler which gives us containers with the right amount of GPU/CPU per jobs.
When you run in local like you are doing, there will be multiple processes for each shard that you feed in, but there is nothing special to spread each process on its own local GPU.
You could try to set the cuda device at the beginning of the embedding module based on the iteration value maybe. Not sure if that would work. If you have an environment where you can test this, we would love a PR to do this.
Thanks for the reply. So, just to be clear, when using stopes with cluster: local only 1 GPU is used even though the machine has more?
Indeed, adding the following lines at the beginning of the run method of PrepocessEncoderModule allows to use all the gpus for embedding the data:
def run(
self,
iteration_value: tp.Optional[tp.Any] = None,
iteration_index: int = 0,
):
# Setting cuda visible device for each shard
num_gpus = self.config.encode.config.requirements.gpus_per_node
try:
rank = int(iteration_value.split(".")[-2]) % num_gpus
torch.cuda.set_device(rank)
except ValueError:
pass
[...]
I can open a PR if you like.
Best,
Z
thanks for checking that it works. You probably want to use iteration_index instead of relying on the file name parsing (iteration_value), but this seems great. If you can, send a PR.
You can alternatively setup a slurm cluster, on your machine - I just made my 2x RTX 3090 Ubuntu machine into a "slurm cluster"