sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

Huggingface estimator tries to import tensorflow when pytorch is defined

Open marinone94 opened this issue 2 years ago • 1 comments

Hi,

The entrypoint script of my training job fails when using the Huggingface estimator with Pytorch because it tries to import Tensorflow, which is not installed in the image. The code here attached worked with no issue until 2022/07/13.

sagemaker version: 2.97.0

train_step.py:

from sagemaker.huggingface import HuggingFace

[...]

hf_estimator = HuggingFace(
    entry_point='entrypoint.py',
    instance_type='ml.g4dn.xlarge',
    instance_count=1,
    role=role,
    transformers_version='4.17.0',
    pytorch_version='1.10.2',
    py_version='py38',
    hyperparameters= hyperparameters,
    environment= sm_env_vars,
)
hf_estimator.fit()
[...]

entrypoint.py:

from transformers import Trainer

[...]

trainer = Trainer(
        model=model.to(device),
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        compute_metrics=_compute_metrics
)
trainer.train()

[...]

Stack trace:

[...]
022-08-19 08:16:37,196 - INFO Defaulting to the only supported framework/algorithm version: latest.
2022-08-19 08:16:37,302 - INFO Ignoring unnecessary instance type: None.
2022-08-19 08:16:37,330 - INFO Creating training-job with name: huggingface-pytorch-training-2022-08-19-08-16-36-060
2022-08-19 08:16:37 Starting - Starting the training job...
2022-08-19 08:17:01 Starting - Preparing the instances for trainingProfilerReport-1660896997: InProgress
......
2022-08-19 08:18:01 Downloading - Downloading input data...
2022-08-19 08:18:41 Training - Downloading the training image........................
2022-08-19 08:22:42 Training - Training image download completed. Training in progress.bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-08-19 08:22:41,913 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2022-08-19 08:22:41,936 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-08-19 08:22:41,942 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-08-19 08:22:42,507 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "current_instance_group": "homogeneousCluster",
    "current_instance_group_hosts": [
        "algo-1"
    ],
    "current_instance_type": "ml.g4dn.xlarge",
    "distribution_hosts": [],
    "distribution_instance_groups": [],
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "context": "business",
        "hf_test_path": "hf_test.csv",
        "hf_test_results_path": "hf_model_results.csv",
        "lf_test_path": "lf_test.csv",
        "lf_test_results_path": "lf_model_results.csv",
        "model_params": {
            "pretrained_model_id": "bert-base-multilingual-cased",
            "num_train_epoch": 3,
            "per_device_train_batch_size": 16,
            "per_device_eval_batch_size": 16,
            "warmup_steps": 500,
            "weight_decay": 0.01,
            "logging_steps": 10
        },
        "model_path": "models/finetuned",
        "params_path": "params_business_mbert_cased.yaml",
        "paths_config_path": "paths.yaml",
        "run_id": "2022/08/19/094046794149",
        "train_path": "prep_train.csv"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {},
    "input_dir": "/opt/ml/input",
    "instance_groups": [
        "homogeneousCluster"
    ],
    "instance_groups_dict": {
        "homogeneousCluster": {
            "instance_group_name": "homogeneousCluster",
            "instance_type": "ml.g4dn.xlarge",
            "hosts": [
                "algo-1"
            ]
        }
    },
    "is_hetero": false,
    "is_master": true,
    "is_modelparallel_enabled": null,
    "job_name": "huggingface-pytorch-training-2022-08-19-08-16-36-060",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-eu-west-1-XXXXXXXXXXX/huggingface-pytorch-training-2022-08-19-08-16-36-060/source/sourcedir.tar.gz",
    "module_name": "entrypoint",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.g4dn.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.g4dn.xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "entrypoint.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"context":"business","hf_test_path":"hf_test.csv","hf_test_results_path":"hf_model_results.csv","lf_test_path":"lf_test.csv","lf_test_results_path":"lf_model_results.csv","model_params":{"logging_steps":10,"num_train_epoch":3,"per_device_eval_batch_size":16,"per_device_train_batch_size":16,"pretrained_model_id":"bert-base-multilingual-cased","warmup_steps":500,"weight_decay":0.01},"model_path":"models/finetuned","params_path":"params_business_mbert_cased.yaml","paths_config_path":"paths.yaml","run_id":"2022/08/19/094046794149","train_path":"prep_train.csv"}
SM_USER_ENTRY_POINT=entrypoint.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_CURRENT_INSTANCE_TYPE=ml.g4dn.xlarge
SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
SM_INSTANCE_GROUPS=["homogeneousCluster"]
SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}}
SM_DISTRIBUTION_INSTANCE_GROUPS=[]
SM_IS_HETERO=false
SM_MODULE_NAME=entrypoint
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-west-1-XXXXXXXXXXXX/huggingface-pytorch-training-2022-08-19-08-16-36-060/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g4dn.xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"context":"business","hf_test_path":"hf_test.csv","hf_test_results_path":"hf_model_results.csv","lf_test_path":"lf_test.csv","lf_test_results_path":"lf_model_results.csv","model_params":{"logging_steps":10,"num_train_epoch":3,"per_device_eval_batch_size":16,"per_device_train_batch_size":16,"pretrained_model_id":"bert-base-multilingual-cased","warmup_steps":500,"weight_decay":0.01},"model_path":"models/finetuned","params_path":"params_business_mbert_cased.yaml","paths_config_path":"paths.yaml","run_id":"2022/08/19/094046794149","train_path":"prep_train.csv"},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"job_name":"huggingface-pytorch-training-2022-08-19-08-16-36-060","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-west-1-XXXXXXXXXXXXXXX/huggingface-pytorch-training-2022-08-19-08-16-36-060/source/sourcedir.tar.gz","module_name":"entrypoint","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"entrypoint.py"}
SM_USER_ARGS=["--context","business","--hf_test_path","hf_test.csv","--hf_test_results_path","hf_model_results.csv","--lf_test_path","lf_test.csv","--lf_test_results_path","lf_model_results.csv","--model_params","logging_steps=10,num_train_epoch=3,per_device_eval_batch_size=16,per_device_train_batch_size=16,pretrained_model_id=bert-base-multilingual-cased,warmup_steps=500,weight_decay=0.01","--model_path","s3://niva-nlu-dev/runs/2022/08/19/094046794149/data/models/finetuned","--params_path","data/run/params_business_mbert_cased.yaml","--paths_config_path","niva_nlu/configs/paths.yaml","--run_id","2022/08/19/094046794149","--train_path","prep_train.csv"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_CONTEXT=business
SM_HP_HF_TEST_PATH=hf_test.csv
SM_HP_HF_TEST_RESULTS_PATH=hf_model_results.csv
SM_HP_LF_TEST_PATH=lf_test.csv
SM_HP_LF_TEST_RESULTS_PATH=lf_model_results.csv
SM_HP_MODEL_PARAMS={"logging_steps":10,"num_train_epoch":3,"per_device_eval_batch_size":16,"per_device_train_batch_size":16,"pretrained_model_id":"bert-base-multilingual-cased","warmup_steps":500,"weight_decay":0.01}
SM_HP_MODEL_PATH=models/finetuned
SM_HP_PARAMS_PATH=params_business_mbert_cased.yaml
SM_HP_PATHS_CONFIG_PATH=paths.yaml
SM_HP_RUN_ID=2022/08/19/094046794149
SM_HP_TRAIN_PATH=prep_train.csv
PYTHONPATH=/opt/ml/code:/opt/conda/lib/python3.8/site-packages/smdistributed/dataparallel/lib:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages:/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument-3.4.2-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument_cext-0.2.4-py3.8-linux-x86_64.egg
Invoking script with the following command:
/opt/conda/bin/python3.8 entrypoint.py --context business --hf_test_path hf_test.csv --hf_test_results_path hf_model_results.csv --lf_test_path lf_test.csv --lf_test_results_path lf_model_results.csv --model_params logging_steps=10,num_train_epoch=3,per_device_eval_batch_size=16,per_device_train_batch_size=16,pretrained_model_id=bert-base-multilingual-cased,warmup_steps=500,weight_decay=0.01 --model_path models/finetuned --params_path params_business_mbert_cased.yaml --paths_config_path paths.yaml --run_id 2022/08/19/094046794149 --train_path prep_train.csv
Invoking script with the following command:
/opt/conda/bin/python3.8 entrypoint.py --context business --hf_test_path hf_test.csv --hf_test_results_path hf_model_results.csv --lf_test_path lf_test.csv --lf_test_results_path lf_model_results.csv --model_params logging_steps=10,num_train_epoch=3,per_device_eval_batch_size=16,per_device_train_batch_size=16,pretrained_model_id=bert-base-multilingual-cased,warmup_steps=500,weight_decay=0.01 --model_path models/finetuned --params_path params_business_mbert_cased.yaml --paths_config_path paths.yaml --run_id 2022/08/19/094046794149 --train_path prep_train.csv
Downloading:   0%|          | 0.00/2.06k [00:00<?, ?B/s]
Downloading: 5.27kB [00:00, 3.76MB/s]
Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]
Downloading: 100%|██████████| 625/625 [00:00<00:00, 534kB/s]
Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]
Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 27.4kB/s]
Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]
Downloading:   4%|▍         | 37.0k/972k [00:00<00:03, 291kB/s]
Downloading:  20%|█▉        | 192k/972k [00:00<00:00, 939kB/s]
Downloading:  81%|████████  | 784k/972k [00:00<00:00, 2.73MB/s]
Downloading: 100%|██████████| 972k/972k [00:00<00:00, 2.52MB/s]
Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]
Downloading:   2%|▏         | 37.0k/1.87M [00:00<00:06, 290kB/s]
Downloading:  11%|█         | 208k/1.87M [00:00<00:01, 902kB/s]
Downloading:  33%|███▎      | 640k/1.87M [00:00<00:00, 2.02MB/s]
Downloading:  95%|█████████▌| 1.78M/1.87M [00:00<00:00, 4.86MB/s]
Downloading: 100%|██████████| 1.87M/1.87M [00:00<00:00, 3.71MB/s]
Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]
Downloading:   1%|          | 5.85M/681M [00:00<00:11, 61.3MB/s]
Downloading:   2%|▏         | 12.2M/681M [00:00<00:10, 64.2MB/s]
Downloading:   3%|▎         | 18.7M/681M [00:00<00:10, 66.0MB/s]
Downloading:   4%|▎         | 25.3M/681M [00:00<00:10, 67.2MB/s]
Downloading:   5%|▍         | 31.7M/681M [00:00<00:10, 65.9MB/s]
Downloading:   6%|▌         | 38.3M/681M [00:00<00:10, 67.0MB/s]
Downloading:   7%|▋         | 44.7M/681M [00:00<00:10, 66.1MB/s]
Downloading:   8%|▊         | 51.2M/681M [00:00<00:09, 66.7MB/s]
Downloading:   9%|▊         | 58.0M/681M [00:00<00:09, 68.2MB/s]
Downloading:  10%|▉         | 65.0M/681M [00:01<00:09, 69.8MB/s]
Downloading:  11%|█         | 71.9M/681M [00:01<00:09, 70.6MB/s]
Downloading:  12%|█▏        | 78.7M/681M [00:01<00:09, 65.7MB/s]
Downloading:  13%|█▎        | 85.5M/681M [00:01<00:09, 67.4MB/s]
Downloading:  14%|█▎        | 92.6M/681M [00:01<00:08, 69.5MB/s]
Downloading:  15%|█▍        | 99.4M/681M [00:01<00:08, 70.0MB/s]
Downloading:  16%|█▌        | 106M/681M [00:01<00:08, 69.0MB/s]
Downloading:  17%|█▋        | 113M/681M [00:01<00:08, 67.4MB/s]
Downloading:  17%|█▋        | 119M/681M [00:01<00:08, 67.3MB/s]
Downloading:  18%|█▊        | 126M/681M [00:01<00:08, 67.7MB/s]
Downloading:  19%|█▉        | 132M/681M [00:02<00:09, 63.7MB/s]
Downloading:  20%|██        | 139M/681M [00:02<00:08, 64.9MB/s]
Downloading:  21%|██▏       | 145M/681M [00:02<00:08, 65.9MB/s]
Downloading:  22%|██▏       | 152M/681M [00:02<00:08, 66.7MB/s]
Downloading:  23%|██▎       | 158M/681M [00:02<00:08, 67.4MB/s]
Downloading:  24%|██▍       | 166M/681M [00:02<00:07, 70.0MB/s]
Downloading:  25%|██▌       | 172M/681M [00:02<00:07, 70.0MB/s]
Downloading:  26%|██▋       | 179M/681M [00:02<00:07, 69.3MB/s]
Downloading:  27%|██▋       | 186M/681M [00:02<00:07, 69.3MB/s]
Downloading:  28%|██▊       | 192M/681M [00:02<00:07, 68.9MB/s]
Downloading:  29%|██▉       | 199M/681M [00:03<00:07, 69.5MB/s]
Downloading:  30%|███       | 206M/681M [00:03<00:07, 70.3MB/s]
Downloading:  31%|███       | 213M/681M [00:03<00:07, 69.9MB/s]
Downloading:  32%|███▏      | 219M/681M [00:03<00:06, 69.8MB/s]
Downloading:  33%|███▎      | 226M/681M [00:03<00:06, 70.0MB/s]
Downloading:  34%|███▍      | 233M/681M [00:03<00:06, 70.0MB/s]
Downloading:  35%|███▌      | 240M/681M [00:03<00:06, 72.0MB/s]
Downloading:  36%|███▋      | 247M/681M [00:03<00:06, 72.7MB/s]
Downloading:  37%|███▋      | 255M/681M [00:03<00:05, 76.2MB/s]
Downloading:  39%|███▊      | 263M/681M [00:03<00:05, 73.5MB/s]
Downloading:  40%|███▉      | 270M/681M [00:04<00:06, 71.1MB/s]
Downloading:  41%|████      | 276M/681M [00:04<00:06, 70.2MB/s]
Downloading:  42%|████▏     | 284M/681M [00:04<00:05, 73.7MB/s]
Downloading:  43%|████▎     | 291M/681M [00:04<00:05, 71.3MB/s]
Downloading:  44%|████▍     | 298M/681M [00:04<00:05, 70.7MB/s]
Downloading:  45%|████▍     | 305M/681M [00:04<00:05, 70.3MB/s]
Downloading:  46%|████▌     | 312M/681M [00:04<00:05, 71.1MB/s]
Downloading:  47%|████▋     | 319M/681M [00:04<00:05, 67.6MB/s]
Downloading:  48%|████▊     | 325M/681M [00:04<00:05, 64.8MB/s]
Downloading:  49%|████▊     | 332M/681M [00:05<00:05, 65.8MB/s]
Downloading:  50%|████▉     | 338M/681M [00:05<00:05, 65.6MB/s]
Downloading:  51%|█████     | 344M/681M [00:05<00:05, 65.3MB/s]
Downloading:  51%|█████▏    | 351M/681M [00:05<00:05, 65.1MB/s]
Downloading:  52%|█████▏    | 357M/681M [00:05<00:05, 65.8MB/s]
Downloading:  53%|█████▎    | 363M/681M [00:05<00:05, 65.9MB/s]
Downloading:  54%|█████▍    | 370M/681M [00:05<00:04, 65.5MB/s]
Downloading:  55%|█████▌    | 376M/681M [00:05<00:04, 65.0MB/s]
Downloading:  56%|█████▌    | 382M/681M [00:05<00:04, 64.6MB/s]
Downloading:  57%|█████▋    | 388M/681M [00:05<00:04, 64.8MB/s]
Downloading:  58%|█████▊    | 395M/681M [00:06<00:04, 66.0MB/s]
Downloading:  59%|█████▉    | 401M/681M [00:06<00:04, 66.1MB/s]
Downloading:  60%|█████▉    | 408M/681M [00:06<00:04, 68.8MB/s]
Downloading:  61%|██████    | 415M/681M [00:06<00:04, 68.0MB/s]
Downloading:  62%|██████▏   | 421M/681M [00:06<00:04, 66.4MB/s]
Downloading:  63%|██████▎   | 428M/681M [00:06<00:03, 67.0MB/s]
Downloading:  64%|██████▍   | 434M/681M [00:06<00:03, 67.4MB/s]
Downloading:  65%|██████▍   | 441M/681M [00:06<00:03, 63.5MB/s]
Downloading:  66%|██████▌   | 447M/681M [00:06<00:03, 64.7MB/s]
Downloading:  67%|██████▋   | 453M/681M [00:07<00:03, 62.9MB/s]
Downloading:  68%|██████▊   | 460M/681M [00:07<00:03, 65.7MB/s]
Downloading:  69%|██████▊   | 467M/681M [00:07<00:03, 65.6MB/s]
Downloading:  69%|██████▉   | 473M/681M [00:07<00:03, 66.3MB/s]
Downloading:  70%|███████   | 480M/681M [00:07<00:03, 67.2MB/s]
Downloading:  71%|███████▏  | 486M/681M [00:07<00:03, 66.6MB/s]
Downloading:  72%|███████▏  | 493M/681M [00:07<00:03, 62.3MB/s]
Downloading:  73%|███████▎  | 499M/681M [00:07<00:03, 63.1MB/s]
Downloading:  74%|███████▍  | 505M/681M [00:07<00:03, 61.1MB/s]
Downloading:  75%|███████▍  | 511M/681M [00:07<00:02, 60.6MB/s]
Downloading:  76%|███████▌  | 517M/681M [00:08<00:02, 62.0MB/s]
Downloading:  77%|███████▋  | 523M/681M [00:08<00:02, 62.0MB/s]
Downloading:  78%|███████▊  | 529M/681M [00:08<00:02, 62.7MB/s]
Downloading:  79%|███████▊  | 535M/681M [00:08<00:02, 59.8MB/s]
Downloading:  79%|███████▉  | 541M/681M [00:08<00:02, 59.2MB/s]
Downloading:  80%|████████  | 547M/681M [00:08<00:02, 59.8MB/s]
Downloading:  81%|████████  | 553M/681M [00:08<00:02, 62.0MB/s]
Downloading:  82%|████████▏ | 560M/681M [00:08<00:01, 64.5MB/s]
Downloading:  83%|████████▎ | 566M/681M [00:08<00:01, 64.3MB/s]
Downloading:  84%|████████▍ | 572M/681M [00:08<00:01, 63.1MB/s]
Downloading:  85%|████████▍ | 578M/681M [00:09<00:01, 62.6MB/s]
Downloading:  86%|████████▌ | 584M/681M [00:09<00:01, 62.4MB/s]
Downloading:  87%|████████▋ | 590M/681M [00:09<00:01, 63.4MB/s]
Downloading:  88%|████████▊ | 599M/681M [00:09<00:01, 71.8MB/s]
Downloading:  89%|████████▉ | 608M/681M [00:09<00:00, 77.8MB/s]
Downloading:  90%|█████████ | 615M/681M [00:09<00:00, 73.3MB/s]
Downloading:  91%|█████████▏| 622M/681M [00:09<00:00, 70.4MB/s]
Downloading:  92%|█████████▏| 629M/681M [00:09<00:00, 67.4MB/s]
Downloading:  94%|█████████▎| 638M/681M [00:09<00:00, 73.7MB/s]
Downloading:  95%|█████████▍| 645M/681M [00:10<00:00, 71.8MB/s]
Downloading:  96%|█████████▌| 652M/681M [00:10<00:00, 68.9MB/s]
Downloading:  97%|█████████▋| 658M/681M [00:10<00:00, 66.1MB/s]
Downloading:  98%|█████████▊| 665M/681M [00:10<00:00, 65.8MB/s]
Downloading:  99%|█████████▊| 672M/681M [00:10<00:00, 68.3MB/s]
Downloading: 100%|██████████| 681M/681M [00:10<00:00, 67.4MB/s]
Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
wandb: Currently logged in as: emilio_marinone (pn-aa). Use `wandb login --relogin` to force relogin
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
#011- Avoid using `tokenizers` before the fork if possible
#011- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
#011- Avoid using `tokenizers` before the fork if possible
#011- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/opt/conda/lib/python3.8/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/opt/conda/lib/python3.8/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/opt/conda/lib/python3.8/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/opt/conda/lib/python3.8/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
wandb: wandb version 0.13.1 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.19
wandb: Run data is saved locally in /opt/ml/code/wandb/run-20220819_082323-2022-08-19-094046794149
wandb: Run `wandb offline` to turn off syncing.
wandb: Resuming run 2022/08/19/094046794149
wandb: ⭐️ View project at https://wandb.ai/pn-aa/niva
wandb: 🚀 View run at https://wandb.ai/pn-aa/niva/runs/2022-08-19-094046794149
/opt/conda/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning#015
  warnings.warn(
***** Running training *****
***** Running training *****
Num examples = 99941#015
  Num Epochs = 3
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16#015
  Gradient Accumulation steps = 1
Num examples = 99941#015
  Num Epochs = 3#015
  Instantaneous batch size per device = 16#015
  Total train batch size (w. parallel, distributed & accumulation) = 16#015
  Gradient Accumulation steps = 1#015
  Total optimization steps = 18741
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Total optimization steps = 18741
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
0% 0/18741 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/utils.py", line 48, in <module>#015
    import smdistributed.modelparallel.tensorflow as smp#015
  File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/tensorflow/__init__.py", line 3, in <module>#015
    from smdistributed.modelparallel.tensorflow.comm import *#015
  File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/tensorflow/comm.py", line 4, in <module>#015
    from smdistributed.modelparallel.tensorflow.state_mod import state
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/tensorflow/state_mod.py", line 6, in <module>#015
    import tensorflow as tf#015
ModuleNotFoundError: No module named 'tensorflow'#015
#015
During handling of the above exception, another exception occurred:#015
#015
Traceback (most recent call last):#015
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3037, in _dep_map#015
    return self.__dep_map#015
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2834, in __getattr__#015
    raise AttributeError(attr)#015
AttributeError: _DistInfoDistribution__dep_map#015
#015
During handling of the above exception, another exception occurred:
Traceback (most recent call last):#015
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3028, in _parsed_pkg_info#015
    return self._pkg_info#015
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2834, in __getattr__#015
    raise AttributeError(attr)#015
AttributeError: _pkg_info#015
#015
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "entrypoint.py", line 483, in <module>#015
    main()  # pylint: disable=no-value-for-parameter
File "entrypoint.py", line 347, in main#015
    trainer.train()#015
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1374, in train#015
    for step, inputs in enumerate(epoch_iterator):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 361, in __iter__#015
    return self._get_iterator()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 304, in _get_iterator#015
    return _SingleProcessDataLoaderIter(self)#015
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 583, in __init__#015
    super(_SingleProcessDataLoaderIter, self).__init__(loader)#015
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 507, in __init__#015
    self._smdebug_hook = get_smdebug_hook()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 50, in get_smdebug_hook#015
    import smdebug.pytorch as smd#015
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked#015
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked#015
  File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
File "<frozen zipimport>", line 259, in load_module#015
  File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/__init__.py", line 2, in <module>#015
    from smdebug.core.collection import CollectionKeys#015
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load#015
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked#015
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked#015
  File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
File "<frozen zipimport>", line 259, in load_module#015
  File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/collection.py", line 7, in <module>#015
    from .reduction_config import ReductionConfig#015
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load#015
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked#015
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible#015
  File "<frozen zipimport>", line 259, in load_module#015
  File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/reduction_config.py", line 7, in <module>#015
    from smdebug.core.utils import split
File "<frozen importlib._bootstrap>", line 991, in _find_and_load#015
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked#015
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible#015
  File "<frozen zipimport>", line 259, in load_module#015
  File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/utils.py", line 53, in <module>#015
    import smdistributed.modelparallel.torch as smp
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/__init__.py", line 11, in <module>#015
    from smdistributed.modelparallel.torch import amp, nn, smplib#015
  File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/amp/__init__.py", line 2, in <module>#015
    from smdistributed.modelparallel.torch.amp.scaler import GradScaler#015
  File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/amp/scaler.py", line 12, in <module>#015
    from smdistributed.modelparallel.torch.comm import CommGroup, allgather
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/comm.py", line 5, in <module>#015
    from smdistributed.modelparallel.torch.state_mod import state
File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/state_mod.py", line 13, in <module>#015
    from smdistributed.modelparallel.backend.state_mod import ModelParallelState#015
  File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/backend/state_mod.py", line 11, in <module>#015
    from smdistributed.modelparallel.backend.utils import bijection_2d#015
  File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/backend/utils.py", line 11, in <module>#015
    from smexperiments.metrics import SageMakerFileMetricsWriter#015
  File "/opt/conda/lib/python3.8/site-packages/smexperiments/__init__.py", line 15, in <module>#015
    __version__ = pkg_resources.require("sagemaker-experiments")[0].version#015
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 909, in require#015
    needed = self.resolve(parse_requirements(requirements))#015
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 803, in resolve#015
    new_requirements = dist.requires(req.extras)[::-1]
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2755, in requires#015
    dm = self._dep_map#015
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3039, in _dep_map#015
    self.__dep_map = self._compute_dependencies()#015
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3048, in _compute_dependencies#015
    for req in self._parsed_pkg_info.get_all('Requires-Dist') or []:#015
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3030, in _parsed_pkg_info#015
    metadata = self.get_metadata(self.PKG_INFO)
File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1431, in get_metadata#015
    value = self._get(path)#015
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1635, in _get#015
    with open(path, 'rb') as stream:#015
FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.8/site-packages/urllib3-1.26.10.dist-info/METADATA'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: - 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: \ 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
wandb: | 0.000 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: / 0.000 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: - 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: \ 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: | 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: / 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: - 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: \ 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: | 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: / 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: - 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: \ 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)#015wandb:
wandb: 
wandb: Run summary:
wandb: High freq test data link https://niva-nlu-dev...
wandb:  Low freq test data link https://niva-nlu-dev...
wandb:       Training data link https://niva-nlu-dev...
wandb:
wandb: Synced 2022/08/19/094046794149: https://wandb.ai/pn-aa/niva/runs/2022-08-19-094046794149
wandb: Synced 3 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20220819_082323-2022-08-19-094046794149/logs
2022-08-19 08:23:38,120 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2022-08-19 08:23:38,120 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 1 from exiting process.
2022-08-19 08:23:38,121 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2022-08-19 08:23:38,121 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ModuleNotFoundError: No module named 'tensorflow'#015
 #015
 During handling of the above exception, another exception occurred:#015
 Traceback (most recent call last):#015
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3037, in _dep_map#015
 return self.__dep_map#015
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2834, in __getattr__#015
 raise AttributeError(attr)#015
 AttributeError: _DistInfoDistribution__dep_map#015
 During handling of the above exception, another exception occurred
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3028, in _parsed_pkg_info#015
 return self._pkg_info#015
 AttributeError: _pkg_info#015
 Traceback (most recent call last)
 File "entrypoint.py", line 483, in <module>#015
 main()  # pylint: disable=no-value-for-parameter
 File "entrypoint.py", line 347, in main#015
 trainer.train()#015
 File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1374, in train#015
 for step, inputs in enumerate(epoch_iterator)
 File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 361, in __iter__#015
 return self._get_iterator()
 File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 304, in _get_iterator#015
 return _SingleProcessDataLoaderIter(self)#015
 File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 583, in __init__#015
 super(_SingleProcessDataLoaderIter, self).__init__(loader)#015
 File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 507, in __init__#015
 self._smdebug_hook = get_smdebug_hook()
 File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 50, in get_smdebug_hook#015
 import smdebug.pytorch as smd#015
 File "<frozen importlib._bootstrap>", line 991, in _find_and_load
 File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked#015
 File "<frozen importlib._bootstrap>", line 655, in _load_unlocked#015
 File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
 File "<frozen zipimport>", line 259, in load_module#015
 File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/__init__.py", line 2, in <module>#015
 from smdebug.core.collection import CollectionKeys#015
 File "<frozen importlib._bootstrap>", line 991, in _find_and_load#015
 File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/collection.py", line 7, in <module>#015
 from .reduction_config import ReductionConfig#015
 File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
 File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible#015
 File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/reduction_config.py", line 7, in <module>#015
 from smdebug.core.utils import split
 File "/opt/conda/lib/python3.8/site-packages/smdebug-1.0.13b20220724-py3.8.egg/smdebug/core/utils.py", line 53, in <module>#015
 import smdistributed.modelparallel.torch as smp
 File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/__init__.py", line 11, in <module>#015
 from smdistributed.modelparallel.torch import amp, nn, smplib#015
 File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/amp/__init__.py", line 2, in <module>#015
 from smdistributed.modelparallel.torch.amp.scaler import GradScaler#015
 File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/amp/scaler.py", line 12, in <module>#015
 from smdistributed.modelparallel.torch.comm import CommGroup, allgather
 File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/comm.py", line 5, in <module>#015
 from smdistributed.modelparallel.torch.state_mod import state
 File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/torch/state_mod.py", line 13, in <module>#015
 from smdistributed.modelparallel.backend.state_mod import ModelParallelState#015
 File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/backend/state_mod.py", line 11, in <module>#015
 from smdistributed.modelparallel.backend.utils import bijection_2d#015
 File "/opt/conda/lib/python3.8/site-packages/smdistributed/modelparallel/backend/utils.py", line 11, in <module>#015
 from smexperiments.metrics import SageMakerFileMetricsWriter#015
 File "/opt/conda/lib/python3.8/site-packages/smexperiments/__init__.py", line 15, in <module>#015
 __version__ = pkg_resources.require("sagemaker-experiments")[0].version#015
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 909, in require#015
 needed = self.resolve(parse_requirements(requirements))#015
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 803, in resolve#015
 new_requirements = dist.requires(req.extras)[::-1]
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2755, in requires#015
 dm = self._dep_map#015
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3039, in _dep_map#015
 self.__dep_map = self._compute_dependencies()#015
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3048, in _compute_dependencies#015
 for req in self._parsed_pkg_info.get_all('Requires-Dist') or []:#015
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3030, in _parsed_pkg_info#015
 metadata = self.get_metadata(self.PKG_INFO)
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1431, in get_metadata#015
 value = self._get(path)#015
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1635, in _get#015
 with open(path, 'rb') as stream:#015
 FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.8/site-packages/urllib3-1.26.10.dist-info/METADATA'
 wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
 wandb: - 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
 wandb: \ 0.000 MB of 0.000 MB uploaded (0.000 MB deduped)
 wandb: | 0.000 MB of 0.023 MB uploaded (0.000 MB deduped)
 wandb: / 0.000 MB of 0.023 MB uploaded (0.000 MB deduped)
 wandb: - 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
 wandb: \ 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
 wandb: | 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
 wandb: / 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
 wandb: \ 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)#015wandb
 wandb
 wandb: Run summary
 wandb: High freq test data link https://niva-nlu-dev...
 wandb:  Low freq test data link https://niva-nlu-dev...
 wandb:       Training data link https://niva-nlu-dev...
 wandb: Synced 2022/08/19/094046794149: https://wandb.ai/pn-aa/niva/runs/2022-08-19-094046794149
 wandb: Synced 3 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
 wandb: Find logs at: ./wandb/run-20220819_082323-2022-08-19-094046794149/logs"2022-08-19 08:24:06,586 - ERROR Failed fitting hf estimator: Error for Training job huggingface-pytorch-training-2022-08-19-08-16-36-060: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ModuleNotFoundError: No module named 'tensorflow'
 
 During handling of the above exception, another exception occurred:
 Traceback (most recent call last):
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3037, in _dep_map
 return self.__dep_map
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2834, in __getattr__
 raise AttributeError(attr)
 AttributeError: _DistInfoDistribution__dep_map
 During handling of the above exception, another exception occurred
 File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3028, in _parsed_pkg_info
 return self._pkg_info
 AttributeError: _pkg_info
 Traceback (most recent call last)
 File "entrypoint.py", line 483, in <module>
 main()  # pylint: disable=no-value-for-parameter
 File "entrypoint.py", line 347, in main
 trainer.train()
 File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 137

marinone94 avatar Aug 19 '22 13:08 marinone94

Please note that running the entrypoint inside the DLC on my local machine (no GPU) works fine.

marinone94 avatar Aug 24 '22 16:08 marinone94

I found that this was fixed for me when I changed the python versions in my requirements.txt:

transformers==4.25.1
datasets==2.9.0
accelerate==0.16.0
evaluate==0.4.0
deepspeed==0.9.1
ninja
rouge-score 
nltk 
py7zr

mritterfigma avatar Mar 06 '24 14:03 mritterfigma

Close this ticket since it works after updating package version. Feel free to open again if issue persists.

zhaoqizqwang avatar May 06 '24 17:05 zhaoqizqwang