sagemaker-python-sdk
sagemaker-python-sdk copied to clipboard
Huggingface estimator tries to import tensorflow when pytorch is defined
Hi,
The entrypoint script of my training job fails when using the Huggingface estimator with Pytorch because it tries to import Tensorflow, which is not installed in the image. The code here attached worked with no issue until 2022/07/13.
sagemaker version: 2.97.0
train_step.py:
from sagemaker.huggingface import HuggingFace
[...]
hf_estimator = HuggingFace(
entry_point='entrypoint.py',
instance_type='ml.g4dn.xlarge',
instance_count=1,
role=role,
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
hyperparameters= hyperparameters,
environment= sm_env_vars,
)
hf_estimator.fit()
[...]
entrypoint.py:
from transformers import Trainer
[...]
trainer = Trainer(
model=model.to(device),
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer,
compute_metrics=_compute_metrics
)
trainer.train()
[...]
Stack trace:
[...]
022-08-19 08:16:37,196 - INFO Defaulting to the only supported framework/algorithm version: latest.
2022-08-19 08:16:37,302 - INFO Ignoring unnecessary instance type: None.
2022-08-19 08:16:37,330 - INFO Creating training-job with name: huggingface-pytorch-training-2022-08-19-08-16-36-060
2022-08-19 08:16:37 Starting - Starting the training job...
2022-08-19 08:17:01 Starting - Preparing the instances for trainingProfilerReport-1660896997: InProgress
......
2022-08-19 08:18:01 Downloading - Downloading input data...
2022-08-19 08:18:41 Training - Downloading the training image........................
2022-08-19 08:22:42 Training - Training image download completed. Training in progress.bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-08-19 08:22:41,913 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2022-08-19 08:22:41,936 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2022-08-19 08:22:41,942 sagemaker_pytorch_container.training INFO Invoking user training script.
2022-08-19 08:22:42,507 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {},
"current_host": "algo-1",
"current_instance_group": "homogeneousCluster",
"current_instance_group_hosts": [
"algo-1"
],
"current_instance_type": "ml.g4dn.xlarge",
"distribution_hosts": [],
"distribution_instance_groups": [],
"framework_module": "sagemaker_pytorch_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"context": "business",
"hf_test_path": "hf_test.csv",
"hf_test_results_path": "hf_model_results.csv",
"lf_test_path": "lf_test.csv",
"lf_test_results_path": "lf_model_results.csv",
"model_params": {
"pretrained_model_id": "bert-base-multilingual-cased",
"num_train_epoch": 3,
"per_device_train_batch_size": 16,
"per_device_eval_batch_size": 16,
"warmup_steps": 500,
"weight_decay": 0.01,
"logging_steps": 10
},
"model_path": "models/finetuned",
"params_path": "params_business_mbert_cased.yaml",
"paths_config_path": "paths.yaml",
"run_id": "2022/08/19/094046794149",
"train_path": "prep_train.csv"
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {},
"input_dir": "/opt/ml/input",
"instance_groups": [
"homogeneousCluster"
],
"instance_groups_dict": {
"homogeneousCluster": {
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.g4dn.xlarge",
"hosts": [
"algo-1"
]
}
},
"is_hetero": false,
"is_master": true,
"is_modelparallel_enabled": null,
"job_name": "huggingface-pytorch-training-2022-08-19-08-16-36-060",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-eu-west-1-XXXXXXXXXXX/huggingface-pytorch-training-2022-08-19-08-16-36-060/source/sourcedir.tar.gz",
"module_name": "entrypoint",
"network_interface_name": "eth0",
"num_cpus": 4,
"num_gpus": 1,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"current_instance_type": "ml.g4dn.xlarge",
"current_group_name": "homogeneousCluster",
"hosts": [
"algo-1"
],
"instance_groups": [
{
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.g4dn.xlarge",
"hosts": [
"algo-1"
]
}
],
"network_interface_name": "eth0"
},
"user_entry_point": "entrypoint.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"context":"business","hf_test_path":"hf_test.csv","hf_test_results_path":"hf_model_results.csv","lf_test_path":"lf_test.csv","lf_test_results_path":"lf_model_results.csv","model_params":{"logging_steps":10,"num_train_epoch":3,"per_device_eval_batch_size":16,"per_device_train_batch_size":16,"pretrained_model_id":"bert-base-multilingual-cased","warmup_steps":500,"weight_decay":0.01},"model_path":"models/finetuned","params_path":"params_business_mbert_cased.yaml","paths_config_path":"paths.yaml","run_id":"2022/08/19/094046794149","train_path":"prep_train.csv"}
SM_USER_ENTRY_POINT=entrypoint.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_CURRENT_INSTANCE_TYPE=ml.g4dn.xlarge
SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
SM_INSTANCE_GROUPS=["homogeneousCluster"]
SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}}
SM_DISTRIBUTION_INSTANCE_GROUPS=[]
SM_IS_HETERO=false
SM_MODULE_NAME=entrypoint
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-west-1-XXXXXXXXXXXX/huggingface-pytorch-training-2022-08-19-08-16-36-060/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g4dn.xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"context":"business","hf_test_path":"hf_test.csv","hf_test_results_path":"hf_model_results.csv","lf_test_path":"lf_test.csv","lf_test_results_path":"lf_model_results.csv","model_params":{"logging_steps":10,"num_train_epoch":3,"per_device_eval_batch_size":16,"per_device_train_batch_size":16,"pretrained_model_id":"bert-base-multilingual-cased","warmup_steps":500,"weight_decay":0.01},"model_path":"models/finetuned","params_path":"params_business_mbert_cased.yaml","paths_config_path":"paths.yaml","run_id":"2022/08/19/094046794149","train_path":"prep_train.csv"},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"job_name":"huggingface-pytorch-training-2022-08-19-08-16-36-060","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-west-1-XXXXXXXXXXXXXXX/huggingface-pytorch-training-2022-08-19-08-16-36-060/source/sourcedir.tar.gz","module_name":"entrypoint","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"entrypoint.py"}
SM_USER_ARGS=["--context","business","--hf_test_path","hf_test.csv","--hf_test_results_path","hf_model_results.csv","--lf_test_path","lf_test.csv","--lf_test_results_path","lf_model_results.csv","--model_params","logging_steps=10,num_train_epoch=3,per_device_eval_batch_size=16,per_device_train_batch_size=16,pretrained_model_id=bert-base-multilingual-cased,warmup_steps=500,weight_decay=0.01","--model_path","s3://niva-nlu-dev/runs/2022/08/19/094046794149/data/models/finetuned","--params_path","data/run/params_business_mbert_cased.yaml","--paths_config_path","niva_nlu/configs/paths.yaml","--run_id","2022/08/19/094046794149","--train_path","prep_train.csv"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_CONTEXT=business
SM_HP_HF_TEST_PATH=hf_test.csv
SM_HP_HF_TEST_RESULTS_PATH=hf_model_results.csv
SM_HP_LF_TEST_PATH=lf_test.csv
SM_HP_LF_TEST_RESULTS_PATH=lf_model_results.csv
SM_HP_MODEL_PARAMS={"logging_steps":10,"num_train_epoch":3,"per_device_eval_batch_size":16,"per_device_train_batch_size":16,"pretrained_model_id":"bert-base-multilingual-cased","warmup_steps":500,"weight_decay":0.01}
SM_HP_MODEL_PATH=models/finetuned
SM_HP_PARAMS_PATH=params_business_mbert_cased.yaml
SM_HP_PATHS_CONFIG_PATH=paths.yaml
SM_HP_RUN_ID=2022/08/19/094046794149
SM_HP_TRAIN_PATH=prep_train.csv
PYTHONPATH=/opt/ml/code:/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/lib:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages:/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument-3.4.2-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument_cext-0.2.4-py3.8-linux-x86_64.egg
Invoking script with the following command:
/opt/conda/bin/python3.8 entrypoint.py --context business --hf_test_path hf_test.csv --hf_test_results_path hf_model_results.csv --lf_test_path lf_test.csv --lf_test_results_path lf_model_results.csv --model_params logging_steps=10,num_train_epoch=3,per_device_eval_batch_size=16,per_device_train_batch_size=16,pretrained_model_id=bert-base-multilingual-cased,warmup_steps=500,weight_decay=0.01 --model_path models/finetuned --params_path params_business_mbert_cased.yaml --paths_config_path paths.yaml --run_id 2022/08/19/094046794149 --train_path prep_train.csv
Invoking script with the following command:
/opt/conda/bin/python3.8 entrypoint.py --context business --hf_test_path hf_test.csv --hf_test_results_path hf_model_results.csv --lf_test_path lf_test.csv --lf_test_results_path lf_model_results.csv --model_params logging_steps=10,num_train_epoch=3,per_device_eval_batch_size=16,per_device_train_batch_size=16,pretrained_model_id=bert-base-multilingual-cased,warmup_steps=500,weight_decay=0.01 --model_path models/finetuned --params_path params_business_mbert_cased.yaml --paths_config_path paths.yaml --run_id 2022/08/19/094046794149 --train_path prep_train.csv
Downloading: 0%| | 0.00/2.06k [00:00<?, ?B/s]
Downloading: 5.27kB [00:00, 3.76MB/s]
Downloading: 0%| | 0.00/625 [00:00<?, ?B/s]
Downloading: 100%|██████████| 625/625 [00:00<00:00, 534kB/s]
Downloading: 0%| | 0.00/29.0 [00:00<?, ?B/s]
Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 27.4kB/s]
Downloading: 0%| | 0.00/972k [00:00<?, ?B/s]
Downloading: 4%|▍ | 37.0k/972k [00:00<00:03, 291kB/s]
Downloading: 20%|█▉ | 192k/972k [00:00<00:00, 939kB/s]
Downloading: 81%|████████ | 784k/972k [00:00<00:00, 2.73MB/s]
Downloading: 100%|██████████| 972k/972k [00:00<00:00, 2.52MB/s]
Downloading: 0%| | 0.00/1.87M [00:00<?, ?B/s]
Downloading: 2%|▏ | 37.0k/1.87M [00:00<00:06, 290kB/s]
Downloading: 11%|█ | 208k/1.87M [00:00<00:01, 902kB/s]
Downloading: 33%|███▎ | 640k/1.87M [00:00<00:00, 2.02MB/s]
Downloading: 95%|█████████▌| 1.78M/1.87M [00:00<00:00, 4.86MB/s]
Downloading: 100%|██████████| 1.87M/1.87M [00:00<00:00, 3.71MB/s]
Downloading: 0%| | 0.00/681M [00:00<?, ?B/s]
Downloading: 1%| | 5.85M/681M [00:00<00:11, 61.3MB/s]
Downloading: 2%|▏ | 12.2M/681M [00:00<00:10, 64.2MB/s]
Downloading: 3%|▎ | 18.7M/681M [00:00<00:10, 66.0MB/s]
Downloading: 4%|▎ | 25.3M/681M [00:00<00:10, 67.2MB/s]
Downloading: 5%|▍ | 31.7M/681M [00:00<00:10, 65.9MB/s]
Downloading: 6%|▌ | 38.3M/681M [00:00<00:10, 67.0MB/s]
Downloading: 7%|▋ | 44.7M/681M [00:00<00:10, 66.1MB/s]
Downloading: 8%|▊ | 51.2M/681M [00:00<00:09, 66.7MB/s]
Downloading: 9%|▊ | 58.0M/681M [00:00<00:09, 68.2MB/s]
Downloading: 10%|▉ | 65.0M/681M [00:01<00:09, 69.8MB/s]
Downloading: 11%|█ | 71.9M/681M [00:01<00:09, 70.6MB/s]
Downloading: 12%|█▏ | 78.7M/681M [00:01<00:09, 65.7MB/s]
Downloading: 13%|█▎ | 85.5M/681M [00:01<00:09, 67.4MB/s]
Downloading: 14%|█▎ | 92.6M/681M [00:01<00:08, 69.5MB/s]
Downloading: 15%|█▍ | 99.4M/681M [00:01<00:08, 70.0MB/s]
Downloading: 16%|█▌ | 106M/681M [00:01<00:08, 69.0MB/s]
Downloading: 17%|█▋ | 113M/681M [00:01<00:08, 67.4MB/s]
Downloading: 17%|█▋ | 119M/681M [00:01<00:08, 67.3MB/s]
Downloading: 18%|█▊ | 126M/681M [00:01<00:08, 67.7MB/s]
Downloading: 19%|█▉ | 132M/681M [00:02<00:09, 63.7MB/s]
Downloading: 20%|██ | 139M/681M [00:02<00:08, 64.9MB/s]
Downloading: 21%|██▏ | 145M/681M [00:02<00:08, 65.9MB/s]
Downloading: 22%|██▏ | 152M/681M [00:02<00:08, 66.7MB/s]
Downloading: 23%|██▎ | 158M/681M [00:02<00:08, 67.4MB/s]
Downloading: 24%|██▍ | 166M/681M [00:02<00:07, 70.0MB/s]
Downloading: 25%|██▌ | 172M/681M [00:02<00:07, 70.0MB/s]
Downloading: 26%|██▋ | 179M/681M [00:02<00:07, 69.3MB/s]
Downloading: 27%|██▋ | 186M/681M [00:02<00:07, 69.3MB/s]
Downloading: 28%|██▊ | 192M/681M [00:02<00:07, 68.9MB/s]
Downloading: 29%|██▉ | 199M/681M [00:03<00:07, 69.5MB/s]
Downloading: 30%|███ | 206M/681M [00:03<00:07, 70.3MB/s]
Downloading: 31%|███ | 213M/681M [00:03<00:07, 69.9MB/s]
Downloading: 32%|███▏ | 219M/681M [00:03<00:06, 69.8MB/s]
Downloading: 33%|███▎ | 226M/681M [00:03<00:06, 70.0MB/s]
Downloading: 34%|███▍ | 233M/681M [00:03<00:06, 70.0MB/s]
Downloading: 35%|███▌ | 240M/681M [00:03<00:06, 72.0MB/s]
Downloading: 36%|███▋ | 247M/681M [00:03<00:06, 72.7MB/s]
Downloading: 37%|███▋ | 255M/681M [00:03<00:05, 76.2MB/s]
Downloading: 39%|███▊ | 263M/681M [00:03<00:05, 73.5MB/s]
Downloading: 40%|███▉ | 270M/681M [00:04<00:06, 71.1MB/s]
Downloading: 41%|████ | 276M/681M [00:04<00:06, 70.2MB/s]
Downloading: 42%|████▏ | 284M/681M [00:04<00:05, 73.7MB/s]
Downloading: 43%|████▎ | 291M/681M [00:04<00:05, 71.3MB/s]
Downloading: 44%|████▍ | 298M/681M [00:04<00:05, 70.7MB/s]
Downloading: 45%|████▍ | 305M/681M [00:04<00:05, 70.3MB/s]
Downloading: 46%|████▌ | 312M/681M [00:04<00:05, 71.1MB/s]
Downloading: 47%|████▋ | 319M/681M [00:04<00:05, 67.6MB/s]
Downloading: 48%|████▊ | 325M/681M [00:04<00:05, 64.8MB/s]
Downloading: 49%|████▊ | 332M/681M [00:05<00:05, 65.8MB/s]
Downloading: 50%|████▉ | 338M/681M [00:05<00:05, 65.6MB/s]
Downloading: 51%|█████ | 344M/681M [00:05<00:05, 65.3MB/s]
Downloading: 51%|█████▏ | 351M/681M [00:05<00:05, 65.1MB/s]
Downloading: 52%|█████▏ | 357M/681M [00:05<00:05, 65.8MB/s]
Downloading: 53%|█████▎ | 363M/681M [00:05<00:05, 65.9MB/s]
Downloading: 54%|█████▍ | 370M/681M [00:05<00:04, 65.5MB/s]
Downloading: 55%|█████▌ | 376M/681M [00:05<00:04, 65.0MB/s]
Downloading: 56%|█████▌ | 382M/681M [00:05<00:04, 64.6MB/s]
Downloading: 57%|█████▋ | 388M/681M [00:05<00:04, 64.8MB/s]
Downloading: 58%|█████▊ | 395M/681M [00:06<00:04, 66.0MB/s]
Downloading: 59%|█████▉ | 401M/681M [00:06<00:04, 66.1MB/s]
Downloading: 60%|█████▉ | 408M/681M [00:06<00:04, 68.8MB/s]
Downloading: 61%|██████ | 415M/681M [00:06<00:04, 68.0MB/s]
Downloading: 62%|██████▏ | 421M/681M [00:06<00:04, 66.4MB/s]
Downloading: 63%|██████▎ | 428M/681M [00:06<00:03, 67.0MB/s]
Downloading: 64%|██████▍ | 434M/681M [00:06<00:03, 67.4MB/s]
Downloading: 65%|██████▍ | 441M/681M [00:06<00:03, 63.5MB/s]
Downloading: 66%|██████▌ | 447M/681M [00:06<00:03, 64.7MB/s]
Downloading: 67%|██████▋ | 453M/681M [00:07<00:03, 62.9MB/s]
Downloading: 68%|██████▊ | 460M/681M [00:07<00:03, 65.7MB/s]
Downloading: 69%|██████▊ | 467M/681M [00:07<00:03, 65.6MB/s]
Downloading: 69%|██████▉ | 473M/681M [00:07<00:03, 66.3MB/s]
Downloading: 70%|███████ | 480M/681M [00:07<00:03, 67.2MB/s]
Downloading: 71%|███████▏ | 486M/681M [00:07<00:03, 66.6MB/s]
Downloading: 72%|███████▏ | 493M/681M [00:07<00:03, 62.3MB/s]
Downloading: 73%|███████▎ | 499M/681M [00:07<00:03, 63.1MB/s]
Downloading: 74%|███████▍ | 505M/681M [00:07<00:03, 61.1MB/s]
Downloading: 75%|███████▍ | 511M/681M [00:07<00:02, 60.6MB/s]
Downloading: 76%|███████▌ | 517M/681M [00:08<00:02, 62.0MB/s]
Downloading: 77%|███████▋ | 523M/681M [00:08<00:02, 62.0MB/s]
Downloading: 78%|███████▊ | 529M/681M [00:08<00:02, 62.7MB/s]
Downloading: 79%|███████▊ | 535M/681M [00:08<00:02, 59.8MB/s]
Downloading: 79%|███████▉ | 541M/681M [00:08<00:02, 59.2MB/s]
Downloading: 80%|████████ | 547M/681M [00:08<00:02, 59.8MB/s]
Downloading: 81%|████████ | 553M/681M [00:08<00:02, 62.0MB/s]
Downloading: 82%|████████▏ | 560M/681M [00:08<00:01, 64.5MB/s]
Downloading: 83%|████████▎ | 566M/681M [00:08<00:01, 64.3MB/s]
Downloading: 84%|████████▍ | 572M/681M [00:08<00:01, 63.1MB/s]
Downloading: 85%|████████▍ | 578M/681M [00:09<00:01, 62.6MB/s]
Downloading: 86%|████████▌ | 584M/681M [00:09<00:01, 62.4MB/s]
Downloading: 87%|████████▋ | 590M/681M [00:09<00:01, 63.4MB/s]
Downloading: 88%|████████▊ | 599M/681M [00:09<00:01, 71.8MB/s]
Downloading: 89%|████████▉ | 608M/681M [00:09<00:00, 77.8MB/s]
Downloading: 90%|█████████ | 615M/681M [00:09<00:00, 73.3MB/s]
Downloading: 91%|█████████▏| 622M/681M [00:09<00:00, 70.4MB/s]
Downloading: 92%|█████████▏| 629M/681M [00:09<00:00, 67.4MB/s]
Downloading: 94%|█████████▎| 638M/681M [00:09<00:00, 73.7MB/s]
Downloading: 95%|█████████▍| 645M/681M [00:10<00:00, 71.8MB/s]
Downloading: 96%|█████████▌| 652M/681M [00:10<00:00, 68.9MB/s]
Downloading: 97%|█████████▋| 658M/681M [00:10<00:00, 66.1MB/s]
Downloading: 98%|█████████▊| 665M/681M [00:10<00:00, 65.8MB/s]
Downloading: 99%|█████████▊| 672M/681M [00:10<00:00, 68.3MB/s]
Downloading: 100%|██████████| 681M/681M [00:10<00:00, 67.4MB/s]
Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
wandb: Currently logged in as: emilio_marinone (pn-aa). Use `wandb login --relogin` to force relogin
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
#011- Avoid using `tokenizers` before the fork if possible
#011- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
#011- Avoid using `tokenizers` before the fork if possible
#011- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/opt/conda/lib/python3.8/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/opt/conda/lib/python3.8/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
return self._float_to_str(self.smallest_subnormal)
/opt/conda/lib/python3.8/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/opt/conda/lib/python3.8/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
return self._float_to_str(self.smallest_subnormal)
wandb: wandb version 0.13.1 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.19
wandb: Run data is saved locally in /opt/ml/code/wandb/run-20220819_082323-2022-08-19-094046794149
wandb: Run `wandb offline` to turn off syncing.
wandb: Resuming run 2022/08/19/094046794149
wandb: ⭐️ View project at https://wandb.ai/pn-aa/niva
wandb: 🚀 View run at https://wandb.ai/pn-aa/niva/runs/2022-08-19-094046794149
/opt/conda/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning#015
warnings.warn(
***** Running training *****
***** Running training *****
Num examples = 99941#015
Num Epochs = 3
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16#015
Gradient Accumulation steps = 1
Num examples = 99941#015
Num Epochs = 3#015
Instantaneous batch size per device = 16#015
Total train batch size (w. parallel, distributed & accumulation) = 16#015
Gradient Accumulation steps = 1#015
Total optimization steps = 18741
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Total optimization steps = 18741
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
0% 0/18741 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/utils.py", line 48, in <module>#015
import smdistributed.modelparallel.tensorflow as smp#015
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/tensorflow/__init__.py", line 3, in <module>#015
from smdistributed.modelparallel.tensorflow.comm import *#015
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/tensorflow/comm.py", line 4, in <module>#015
from smdistributed.modelparallel.tensorflow.state_mod import state
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/tensorflow/state_mod.py", line 6, in <module>#015
import tensorflow as tf#015
ModuleNotFoundError: No module named 'tensorflow'#015
#015
During handling of the above exception, another exception occurred:#015
#015
Traceback (most recent call last):#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3037, in _dep_map#015
return self.__dep_map#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2834, in __getattr__#015
raise AttributeError(attr)#015
AttributeError: _DistInfoDistribution__dep_map#015
#015
During handling of the above exception, another exception occurred:
Traceback (most recent call last):#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3028, in _parsed_pkg_info#015
return self._pkg_info#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2834, in __getattr__#015
raise AttributeError(attr)#015
AttributeError: _pkg_info#015
#015
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "entrypoint.py", line 483, in <module>#015
main() # pylint: disable=no-value-for-parameter
File "entrypoint.py", line 347, in main#015
trainer.train()#015
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1374, in train#015
for step, inputs in enumerate(epoch_iterator):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 361, in __iter__#015
return self._get_iterator()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 304, in _get_iterator#015
return _SingleProcessDataLoaderIter(self)#015
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 583, in __init__#015
super(_SingleProcessDataLoaderIter, self).__init__(loader)#015
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 507, in __init__#015
self._smdebug_hook = get_smdebug_hook()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 50, in get_smdebug_hook#015
import smdebug.pytorch as smd#015
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked#015
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked#015
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
File "<frozen zipimport>", line 259, in load_module#015
File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/__init__.py", line 2, in <module>#015
from smdebug.core.collection import CollectionKeys#015
File "<frozen importlib._bootstrap>", line 991, in _find_and_load#015
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked#015
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked#015
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
File "<frozen zipimport>", line 259, in load_module#015
File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/collection.py", line 7, in <module>#015
from .reduction_config import ReductionConfig#015
File "<frozen importlib._bootstrap>", line 991, in _find_and_load#015
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked#015
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible#015
File "<frozen zipimport>", line 259, in load_module#015
File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/reduction_config.py", line 7, in <module>#015
from smdebug.core.utils import split
File "<frozen importlib._bootstrap>", line 991, in _find_and_load#015
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked#015
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible#015
File "<frozen zipimport>", line 259, in load_module#015
File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/utils.py", line 53, in <module>#015
import smdistributed.modelparallel.torch as smp
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/__init__.py", line 11, in <module>#015
from smdistributed.modelparallel.torch import amp, nn, smplib#015
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/amp/__init__.py", line 2, in <module>#015
from smdistributed.modelparallel.torch.amp.scaler import GradScaler#015
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/amp/scaler.py", line 12, in <module>#015
from smdistributed.modelparallel.torch.comm import CommGroup, allgather
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/comm.py", line 5, in <module>#015
from smdistributed.modelparallel.torch.state_mod import state
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/state_mod.py", line 13, in <module>#015
from smdistributed.modelparallel.backend.state_mod import ModelParallelState#015
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/backend/state_mod.py", line 11, in <module>#015
from smdistributed.modelparallel.backend.utils import bijection_2d#015
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/backend/utils.py", line 11, in <module>#015
from smexperiments.metrics import SageMakerFileMetricsWriter#015
File "/opt/conda/lib/python3.8/site-packages/smexperiments/__init__.py", line 15, in <module>#015
__version__ = pkg_resources.require("sagemaker-experiments")[0].version#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 909, in require#015
needed = self.resolve(parse_requirements(requirements))#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 803, in resolve#015
new_requirements = dist.requires(req.extras)[::-1]
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2755, in requires#015
dm = self._dep_map#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3039, in _dep_map#015
self.__dep_map = self._compute_dependencies()#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3048, in _compute_dependencies#015
for req in self._parsed_pkg_info.get_all('Requires-Dist') or []:#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3030, in _parsed_pkg_info#015
metadata = self.get_metadata(self.PKG_INFO)
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1431, in get_metadata#015
value = self._get(path)#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1635, in _get#015
with open(path, 'rb') as stream:#015
FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.8/site-packages/urllib3-1.26.10.dist-info/METADATA'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: - 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: \ 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: | 0.000 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: / 0.000 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: - 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: \ 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: | 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: / 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: - 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: \ 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: | 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: / 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: - 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: \ 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)#015wandb:
wandb:
wandb: Run summary:
wandb: High freq test data link https://niva-nlu-dev...
wandb: Low freq test data link https://niva-nlu-dev...
wandb: Training data link https://niva-nlu-dev...
wandb:
wandb: Synced 2022/08/19/094046794149: https://wandb.ai/pn-aa/niva/runs/2022-08-19-094046794149
wandb: Synced 3 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20220819_082323-2022-08-19-094046794149/logs
2022-08-19 08:23:38,120 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2022-08-19 08:23:38,120 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2022-08-19 08:23:38,121 sagemaker-training-toolkit ERROR Reporting training FAILURE
2022-08-19 08:23:38,121 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ModuleNotFoundError: No module named 'tensorflow'#015
#015
During handling of the above exception, another exception occurred:#015
Traceback (most recent call last):#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3037, in _dep_map#015
return self.__dep_map#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2834, in __getattr__#015
raise AttributeError(attr)#015
AttributeError: _DistInfoDistribution__dep_map#015
During handling of the above exception, another exception occurred
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3028, in _parsed_pkg_info#015
return self._pkg_info#015
AttributeError: _pkg_info#015
Traceback (most recent call last)
File "entrypoint.py", line 483, in <module>#015
main() # pylint: disable=no-value-for-parameter
File "entrypoint.py", line 347, in main#015
trainer.train()#015
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1374, in train#015
for step, inputs in enumerate(epoch_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 361, in __iter__#015
return self._get_iterator()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 304, in _get_iterator#015
return _SingleProcessDataLoaderIter(self)#015
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 583, in __init__#015
super(_SingleProcessDataLoaderIter, self).__init__(loader)#015
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 507, in __init__#015
self._smdebug_hook = get_smdebug_hook()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 50, in get_smdebug_hook#015
import smdebug.pytorch as smd#015
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked#015
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked#015
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
File "<frozen zipimport>", line 259, in load_module#015
File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/__init__.py", line 2, in <module>#015
from smdebug.core.collection import CollectionKeys#015
File "<frozen importlib._bootstrap>", line 991, in _find_and_load#015
File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/collection.py", line 7, in <module>#015
from .reduction_config import ReductionConfig#015
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible#015
File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/reduction_config.py", line 7, in <module>#015
from smdebug.core.utils import split
File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/utils.py", line 53, in <module>#015
import smdistributed.modelparallel.torch as smp
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/__init__.py", line 11, in <module>#015
from smdistributed.modelparallel.torch import amp, nn, smplib#015
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/amp/__init__.py", line 2, in <module>#015
from smdistributed.modelparallel.torch.amp.scaler import GradScaler#015
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/amp/scaler.py", line 12, in <module>#015
from smdistributed.modelparallel.torch.comm import CommGroup, allgather
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/comm.py", line 5, in <module>#015
from smdistributed.modelparallel.torch.state_mod import state
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/state_mod.py", line 13, in <module>#015
from smdistributed.modelparallel.backend.state_mod import ModelParallelState#015
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/backend/state_mod.py", line 11, in <module>#015
from smdistributed.modelparallel.backend.utils import bijection_2d#015
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/backend/utils.py", line 11, in <module>#015
from smexperiments.metrics import SageMakerFileMetricsWriter#015
File "/opt/conda/lib/python3.8/site-packages/smexperiments/__init__.py", line 15, in <module>#015
__version__ = pkg_resources.require("sagemaker-experiments")[0].version#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 909, in require#015
needed = self.resolve(parse_requirements(requirements))#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 803, in resolve#015
new_requirements = dist.requires(req.extras)[::-1]
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2755, in requires#015
dm = self._dep_map#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3039, in _dep_map#015
self.__dep_map = self._compute_dependencies()#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3048, in _compute_dependencies#015
for req in self._parsed_pkg_info.get_all('Requires-Dist') or []:#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3030, in _parsed_pkg_info#015
metadata = self.get_metadata(self.PKG_INFO)
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1431, in get_metadata#015
value = self._get(path)#015
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1635, in _get#015
with open(path, 'rb') as stream:#015
FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.8/site-packages/urllib3-1.26.10.dist-info/METADATA'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: - 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: \ 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: | 0.000 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: / 0.000 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: - 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: \ 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: | 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: / 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: \ 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)#015wandb
wandb
wandb: Run summary
wandb: High freq test data link https://niva-nlu-dev...
wandb: Low freq test data link https://niva-nlu-dev...
wandb: Training data link https://niva-nlu-dev...
wandb: Synced 2022/08/19/094046794149: https://wandb.ai/pn-aa/niva/runs/2022-08-19-094046794149
wandb: Synced 3 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20220819_082323-2022-08-19-094046794149/logs"2022-08-19 08:24:06,586 - ERROR Failed fitting hf estimator: Error for Training job huggingface-pytorch-training-2022-08-19-08-16-36-060: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ModuleNotFoundError: No module named 'tensorflow'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3037, in _dep_map
return self.__dep_map
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2834, in __getattr__
raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map
During handling of the above exception, another exception occurred
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3028, in _parsed_pkg_info
return self._pkg_info
AttributeError: _pkg_info
Traceback (most recent call last)
File "entrypoint.py", line 483, in <module>
main() # pylint: disable=no-value-for-parameter
File "entrypoint.py", line 347, in main
trainer.train()
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 137
Please note that running the entrypoint inside the DLC on my local machine (no GPU) works fine.
I found that this was fixed for me when I changed the python versions in my requirements.txt
:
transformers==4.25.1
datasets==2.9.0
accelerate==0.16.0
evaluate==0.4.0
deepspeed==0.9.1
ninja
rouge-score
nltk
py7zr
Close this ticket since it works after updating package version. Feel free to open again if issue persists.