grok-1
grok-1 copied to clipboard
segfault after having found the 8 GPUs
Dear all, I could setup a DGX-1 (8x A100) for grok-1, with a lot of difficulties in the env.
Still, all seems fine now, but at some point it just segfaults. Any idea???
Singularity> python run.py INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory INFO:rank:Initializing mesh for self.local_mesh_config=(1, 8) self.between_hosts_config=(1, 1)... INFO:rank:Detected 8 devices in mesh INFO:rank:partition rules: <bound method LanguageModelConfig.partition_rules of LanguageModelConfig(model=TransformerConfig(emb_size=6144, key_size=128, num_q_heads=48, num_kv_heads=8, num_layers=64, vocab_size=131072, widening_factor=8, attn_output_multiplier=0.08838834764831845, name=None, num_experts=8, capacity_factor=1.0, num_selected_experts=2, init_scale=1.0, shard_activations=True, data_axis='data', model_axis='model'), vocab_size=131072, pad_token=0, eos_token=2, sequence_len=8192, model_size=6144, embedding_init_scale=1.0, embedding_multiplier_scale=78.38367176906169, output_multiplier_scale=0.5773502691896257, name=None, fprop_dtype=<class 'jax.numpy.bfloat16'>, model_type=None, init_scale_override=None, shard_embeddings=True)> Segmentation fault (core dumped)
I've seen that error before and it was cured by trying other versions of numpy believe it or not
No fix. Problem within numpy. Escalate with numpy
it's trying to load AMD GPU drivers on your Nvidia GPU platform, which obviously doesn't work (question is why the f it's trying to load AMD drivers)