Sanders
Sanders
It's `Intel(R) Xeon(R) Gold 6150 CPU @2.70GHz` model and **x86_64** arch. Check the symbols of **netlib-native_system-linux-x86_64.so** ``` jar -xvf netlib-native_system-linux-x86_64-1.1-natives.jar nm netlib-native_system-linux-x86_64.so ``` There is no symbols > nm: netlib-native_system-linux-x86_64.so:...
Yes, I specified the parameter, here is the whole command: ``` ./angel-submit \ --angel.deploy.mode YARN \ --angel.am.max-attempts 3 \ --angel.worker.max-attempts 3 \ --angel.ps.max-attempts 3 \ --angel.app.submit.class com.tencent.angel.ml.core.graphsubmit.GraphRunner \ --angel.train.data.path "hdfs://xxx/"...
No matter I specify `--principal` and `--keytab` option or `spark.yarn.keytab` and `spark.yarn.principal` configuration, I will get Connection Refused Exception. Please note `kinit` command works fine. ``` Exception while invoking getNewApplication...
I can submit spark example with or without kerberos authentication. ``` spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-client \ --keytab [my_keytab] \ --principal [my_name] \ --num-executors 4 \ --driver-memory 512m \...
Did you specify the keytab file as client local file or submitted it with --file option?
Yes, it is on local.
The **FM** model has this problem, after I predicting on **WAD** model, the prediction data seems much reasonable. The Pos/Neg ratio is 1:3 on both models. Here is my **FM**...
> Please try the following command: > > ``` > ./scripts/run_chatbot.sh \ > pinkmanlove/llama-7b-hf/ \ > ${project_dir}/output_models/llama-7b-v2/ > ``` Sure, I tried like that before. I got the error as...
Any tips for training with zero2? K8S pod get killed due to RAM/GPU Memory overhead.
Process soon be killed after loading checkpoints, if I specify **zero2** config. Even if I reduced the `-block_size` and `--dataloader_num_workers` ``` [2023-04-27 11:10:19,080] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2023-04-27 11:10:23,115] [INFO]...