test_megamolbart.py not working
Hi, I got the MegaMolBART docker container up and running with the following command:
$docker run --name megamolbart --gpus all --rm -v $(pwd)/megamolbart_v0.1/:/models/megamolbart -v $(pwd)/shared/:/shared nvcr.io/nvidia/clara/megamolbart:latest &
I git cloned this repository in shared/ but can't find a way to even test the model. In particular, I get the following error when trying to run test_megamolbart.py:
root@5f1951df00f5:/shared# mv cheminformatics/megamolbart/megamolbart/ . && mv cheminformatics/megamolbart/tests/test_megamolbart.py .
root@13b98eed1cbd:/shared# python test_megamolbart.py
using world size: 1 and model-parallel size: 1
using torch.float32 for parameters ...
-------------------- arguments --------------------
adam_beta1 ...................... 0.9
adam_beta2 ...................... 0.999
adam_eps ........................ 1e-08
adlr_autoresume ................. False
adlr_autoresume_interval ........ 1000
apply_query_key_layer_scaling ... False
apply_residual_connection_post_layernorm False
attention_dropout ............... 0.1
attention_softmax_in_fp32 ....... False
batch_size ...................... None
bert_load ....................... None
bias_dropout_fusion ............. False
bias_gelu_fusion ................ False
block_data_path ................. None
checkpoint_activations .......... False
checkpoint_in_cpu ............... False
checkpoint_num_layers ........... 1
clip_grad ....................... 1.0
contigious_checkpointing ........ False
cpu_optimizer ................... False
cpu_torch_adam .................. False
data_impl ....................... infer
data_path ....................... None
dataset_path .................... None
DDP_impl ........................ local
deepscale ....................... False
deepscale_config ................ None
deepspeed ....................... False
deepspeed_activation_checkpointing False
deepspeed_config ................ None
deepspeed_mpi ................... False
distribute_checkpointed_activations False
distributed_backend ............. nccl
dynamic_loss_scale .............. True
eod_mask_loss ................... False
eval_interval ................... 1000
eval_iters ...................... 100
exit_interval ................... None
faiss_use_gpu ................... False
finetune ........................ False
fp16 ............................ False
fp16_lm_cross_entropy ........... False
fp32_allreduce .................. False
gas ............................. 1
hidden_dropout .................. 0.1
hidden_size ..................... 256
hysteresis ...................... 2
ict_head_size ................... None
ict_load ........................ None
indexer_batch_size .............. 128
indexer_log_interval ............ 1000
init_method_std ................. 0.02
layernorm_epsilon ............... 1e-05
lazy_mpu_init ................... None
load ............................ /models/megamolbart/checkpoints
local_rank ...................... None
log_interval .................... 100
loss_scale ...................... None
loss_scale_window ............... 1000
lr .............................. None
lr_decay_iters .................. None
lr_decay_style .................. linear
make_vocab_size_divisible_by .... 128
mask_prob ....................... 0.15
max_position_embeddings ......... 512
merge_file ...................... None
min_lr .......................... 0.0
min_scale ....................... 1
mmap_warmup ..................... False
model_parallel_size ............. 1
no_load_optim ................... False
no_load_rng ..................... False
no_save_optim ................... False
no_save_rng ..................... False
num_attention_heads ............. 8
num_layers ...................... 4
num_unique_layers ............... None
num_workers ..................... 2
onnx_safe ....................... None
openai_gelu ..................... False
override_lr_scheduler ........... False
param_sharing_style ............. grouped
params_dtype .................... torch.float32
partition_activations ........... False
pipe_parallel_size .............. 0
profile_backward ................ False
query_in_block_prob ............. 0.1
rank ............................ 0
report_topk_accuracies .......... []
reset_attention_mask ............ False
reset_position_ids .............. False
save ............................ None
save_interval ................... None
scaled_masked_softmax_fusion .... False
scaled_upper_triang_masked_softmax_fusion False
seed ............................ 1234
seq_length ...................... None
short_seq_prob .................. 0.1
split ........................... 969, 30, 1
synchronize_each_layer .......... False
tensorboard_dir ................. None
titles_data_path ................ None
tokenizer_type .................. GPT2BPETokenizer
train_iters ..................... None
use_checkpoint_lr_scheduler ..... False
use_cpu_initialization .......... False
use_one_sent_docs ............... False
vocab_file ...................... /models/megamolbart/bart_vocab.txt
warmup .......................... 0.01
weight_decay .................... 0.01
world_size ...................... 1
zero_allgather_bucket_size ...... 0.0
zero_contigious_gradients ....... False
zero_reduce_bucket_size ......... 0.0
zero_reduce_scatter ............. False
zero_stage ...................... 1.0
---------------- end of arguments ----------------
> initializing torch distributed ...
Traceback (most recent call last):
File "test_megamolbart.py", line 16, in <module>
wf = MegaMolBART()
File "/shared/megamolbart/inference.py", line 71, in __init__
initialize_megatron(args_defaults=args, ignore_unknown_args=True)
File "/opt/conda/lib/python3.6/site-packages/megatron/initialize.py", line 77, in initialize_megatron
finish_mpu_init()
File "/opt/conda/lib/python3.6/site-packages/megatron/initialize.py", line 59, in finish_mpu_init
_initialize_distributed()
File "/opt/conda/lib/python3.6/site-packages/megatron/initialize.py", line 156, in _initialize_distributed
init_method=init_method)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 448, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 133, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
I am interested in getting the embeddings for a bunch of molecules. Any suggestion?
I realized that I was using an older version of this repository so I have just cloned it again, and updated the initial comment accordingly. I have also tried to follow the steps suggested in #147 but I get the same errors as the OP.
If you are working from source, please follow these steps.
Terminal 1
- ./launch.sh start This will download pre-reqs and start the application.
Terminal 2
Default IP address of MegaMolBART container is '192.168.100.2'. To confirm the actual IP address please execute the following command.
docker inspect cheminformatics_megamolbart_1 | grep IPv4Address
If IP address is used in the last step.
-
./launch.sh dev 1 This will place you in the container, generally in a mode useful for advance usage and development.
-
conda activate rapids You might need to init conda (conda init bash) before this command
-
ipython3
import grpc
import generativesampler_pb2
import generativesampler_pb2_grpc
host = '192.168.100.2'
with grpc.insecure_channel(f'{host}:50051') as channel:
stub = generativesampler_pb2_grpc.GenerativeSamplerStub(channel)
spec = generativesampler_pb2.GenerativeSpec(
model=generativesampler_pb2.GenerativeModel.MegaMolBART,
smiles='CN1C=NC2=C1C(=O)N(C(=O)N2C)C',
radius=0.0001,
numRequested=10)
response = stub.FindSimilars(spec)
print(response.generatedSmiles)