cheminformatics test_megamolbart.py not working

Hi, I got the MegaMolBART docker container up and running with the following command:

$docker run --name megamolbart --gpus all --rm -v $(pwd)/megamolbart_v0.1/:/models/megamolbart -v $(pwd)/shared/:/shared nvcr.io/nvidia/clara/megamolbart:latest &

I git cloned this repository in shared/ but can't find a way to even test the model. In particular, I get the following error when trying to run test_megamolbart.py:

root@5f1951df00f5:/shared# mv cheminformatics/megamolbart/megamolbart/ . && mv cheminformatics/megamolbart/tests/test_megamolbart.py .
root@13b98eed1cbd:/shared# python test_megamolbart.py 
using world size: 1 and model-parallel size: 1 
using torch.float32 for parameters ...
-------------------- arguments --------------------
  adam_beta1 ...................... 0.9
  adam_beta2 ...................... 0.999
  adam_eps ........................ 1e-08
  adlr_autoresume ................. False
  adlr_autoresume_interval ........ 1000
  apply_query_key_layer_scaling ... False
  apply_residual_connection_post_layernorm  False
  attention_dropout ............... 0.1
  attention_softmax_in_fp32 ....... False
  batch_size ...................... None
  bert_load ....................... None
  bias_dropout_fusion ............. False
  bias_gelu_fusion ................ False
  block_data_path ................. None
  checkpoint_activations .......... False
  checkpoint_in_cpu ............... False
  checkpoint_num_layers ........... 1
  clip_grad ....................... 1.0
  contigious_checkpointing ........ False
  cpu_optimizer ................... False
  cpu_torch_adam .................. False
  data_impl ....................... infer
  data_path ....................... None
  dataset_path .................... None
  DDP_impl ........................ local
  deepscale ....................... False
  deepscale_config ................ None
  deepspeed ....................... False
  deepspeed_activation_checkpointing  False
  deepspeed_config ................ None
  deepspeed_mpi ................... False
  distribute_checkpointed_activations  False
  distributed_backend ............. nccl
  dynamic_loss_scale .............. True
  eod_mask_loss ................... False
  eval_interval ................... 1000
  eval_iters ...................... 100
  exit_interval ................... None
  faiss_use_gpu ................... False
  finetune ........................ False
  fp16 ............................ False
  fp16_lm_cross_entropy ........... False
  fp32_allreduce .................. False
  gas ............................. 1
  hidden_dropout .................. 0.1
  hidden_size ..................... 256
  hysteresis ...................... 2
  ict_head_size ................... None
  ict_load ........................ None
  indexer_batch_size .............. 128
  indexer_log_interval ............ 1000
  init_method_std ................. 0.02
  layernorm_epsilon ............... 1e-05
  lazy_mpu_init ................... None
  load ............................ /models/megamolbart/checkpoints
  local_rank ...................... None
  log_interval .................... 100
  loss_scale ...................... None
  loss_scale_window ............... 1000
  lr .............................. None
  lr_decay_iters .................. None
  lr_decay_style .................. linear
  make_vocab_size_divisible_by .... 128
  mask_prob ....................... 0.15
  max_position_embeddings ......... 512
  merge_file ...................... None
  min_lr .......................... 0.0
  min_scale ....................... 1
  mmap_warmup ..................... False
  model_parallel_size ............. 1
  no_load_optim ................... False
  no_load_rng ..................... False
  no_save_optim ................... False
  no_save_rng ..................... False
  num_attention_heads ............. 8
  num_layers ...................... 4
  num_unique_layers ............... None
  num_workers ..................... 2
  onnx_safe ....................... None
  openai_gelu ..................... False
  override_lr_scheduler ........... False
  param_sharing_style ............. grouped
  params_dtype .................... torch.float32
  partition_activations ........... False
  pipe_parallel_size .............. 0
  profile_backward ................ False
  query_in_block_prob ............. 0.1
  rank ............................ 0
  report_topk_accuracies .......... []
  reset_attention_mask ............ False
  reset_position_ids .............. False
  save ............................ None
  save_interval ................... None
  scaled_masked_softmax_fusion .... False
  scaled_upper_triang_masked_softmax_fusion  False
  seed ............................ 1234
  seq_length ...................... None
  short_seq_prob .................. 0.1
  split ........................... 969, 30, 1
  synchronize_each_layer .......... False
  tensorboard_dir ................. None
  titles_data_path ................ None
  tokenizer_type .................. GPT2BPETokenizer
  train_iters ..................... None
  use_checkpoint_lr_scheduler ..... False
  use_cpu_initialization .......... False
  use_one_sent_docs ............... False
  vocab_file ...................... /models/megamolbart/bart_vocab.txt
  warmup .......................... 0.01
  weight_decay .................... 0.01
  world_size ...................... 1
  zero_allgather_bucket_size ...... 0.0
  zero_contigious_gradients ....... False
  zero_reduce_bucket_size ......... 0.0
  zero_reduce_scatter ............. False
  zero_stage ...................... 1.0
---------------- end of arguments ----------------
> initializing torch distributed ...
Traceback (most recent call last):
  File "test_megamolbart.py", line 16, in <module>
    wf = MegaMolBART()
  File "/shared/megamolbart/inference.py", line 71, in __init__
    initialize_megatron(args_defaults=args, ignore_unknown_args=True)
  File "/opt/conda/lib/python3.6/site-packages/megatron/initialize.py", line 77, in initialize_megatron
    finish_mpu_init()
  File "/opt/conda/lib/python3.6/site-packages/megatron/initialize.py", line 59, in finish_mpu_init
    _initialize_distributed()
  File "/opt/conda/lib/python3.6/site-packages/megatron/initialize.py", line 156, in _initialize_distributed
    init_method=init_method)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 448, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 133, in _tcp_rendezvous_handler
    store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

I am interested in getting the embeddings for a bunch of molecules. Any suggestion?

Apr 29 '22 17:04 mircare

I realized that I was using an older version of this repository so I have just cloned it again, and updated the initial comment accordingly. I have also tried to follow the steps suggested in #147 but I get the same errors as the OP.

May 03 '22 09:05 mircare

If you are working from source, please follow these steps.

Terminal 1

./launch.sh start This will download pre-reqs and start the application.

Terminal 2

Default IP address of MegaMolBART container is '192.168.100.2'. To confirm the actual IP address please execute the following command.

docker inspect cheminformatics_megamolbart_1 | grep IPv4Address

If IP address is used in the last step.

./launch.sh dev 1 This will place you in the container, generally in a mode useful for advance usage and development.
conda activate rapids You might need to init conda (conda init bash) before this command
ipython3

import grpc
import generativesampler_pb2
import generativesampler_pb2_grpc
host = '192.168.100.2'
with grpc.insecure_channel(f'{host}:50051') as channel:
    stub = generativesampler_pb2_grpc.GenerativeSamplerStub(channel)
    spec = generativesampler_pb2.GenerativeSpec(
                model=generativesampler_pb2.GenerativeModel.MegaMolBART,
                smiles='CN1C=NC2=C1C(=O)N(C(=O)N2C)C',
                radius=0.0001,
                numRequested=10)
    response = stub.FindSimilars(spec)
    
print(response.generatedSmiles)

May 03 '22 15:05 rilango