metadata
metadata copied to clipboard
feat: add a Colab notebook as TPU playground
Known Issues
~~I haven't really started debugging issues below.~~
CPU/GPU
This is caused by the breaking change of torch 1.9.0
, so downgrading to torch 1.8.1
(or perhaps 1.8.2) is necessary.
training: 0% 0/50 [00:00<?, ?it/s]wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Traceback (most recent call last):
File "bsmetadata/train.py", line 195, in main
loss = loss_fn(batch, outputs, metadata_mask)
File "bsmetadata/train.py", line 83, in loss_fn
b = outputs.logits.size(0)
AttributeError: 'NoneType' object has no attribute 'size'
TPU
I haven't figured this out but it seems not breaking the outcome (to wandb).
eval: 100%|███████████████████████████████████████| 7/7 [00:05<00:00, 1.30it/s]
tcmalloc: large alloc 1099718656 bytes == 0x55f2c54e8000 @ 0x7fb67d86c615 0x55f17a91102c 0x55f17a9f117a 0x55f17aa29953 0x55f17a914f2e 0x55f17a914e59 0x55f17a8b635e 0x55f17a9c5a0f 0x55f17a9c5ad8 0x55f17a9c744b 0x55f17a9c6537 0x55f17a8b3f75 0x55f17a9c75dc 0x55f17a9c620a 0x55f17a9c5c58 0x55f17a9c744b 0x55f17a9c620a 0x55f17aa5b484 0x55f17a9c769c 0x55f17a9c620a 0x55f17a9c691e 0x55f17a9c516c 0x55f17aa5ba2b 0x55f17a913e4d 0x55f17aa05c0d 0x55f17a9880d8 0x55f17a983235 0x55f17a91573a 0x55f17a983b0e 0x55f17a982c35 0x55f17a91573a
tcmalloc: large alloc 1374650368 bytes == 0x55f307626000 @ 0x7fb67d86c615 0x55f17a91102c 0x55f17a9f117a 0x55f17aa29953 0x55f17a914f2e 0x55f17a914e59 0x55f17aa5aa71 0x55f17aa5b5a2 0x55f17a9cd423 0x55f17a9c5a0f 0x55f17a9c5ad8 0x55f17a9c744b 0x55f17a9c6537 0x55f17a8b3f75 0x55f17a9c75dc 0x55f17a9c620a 0x55f17a9c5c58 0x55f17a9c744b 0x55f17a9c620a 0x55f17a9c691e 0x55f17a9c68d1 0x55f17a9c68d1 0x55f17a9c6968 0x55f17a9c516c 0x55f17aa5ba2b 0x55f17a913e4d 0x55f17aa05c0d 0x55f17a9880d8 0x55f17a983235 0x55f17a91573a 0x55f17a983b0e
tcmalloc: large alloc 1718312960 bytes == 0x55f35951e000 @ 0x7fb67d86c615 0x55f17a91102c 0x55f17a9f117a 0x55f17aa29953 0x55f17a914f2e 0x55f17a914e59 0x55f17a8b635e 0x55f17a9c5a0f 0x55f17a9c5ad8 0x55f17a9c744b 0x55f17a9c6537 0x55f17a8b3f75 0x55f17a9c75dc 0x55f17a9c620a 0x55f17a9c5c58 0x55f17a9c744b 0x55f17a9c620a 0x55f17a9c6968 0x55f17a9c68d1 0x55f17a9c68d1 0x55f17a9c6968 0x55f17a9c516c 0x55f17aa5ba2b 0x55f17a913e4d 0x55f17aa05c0d 0x55f17a9880d8 0x55f17a983235 0x55f17a91573a 0x55f17a983b0e 0x55f17a982c35 0x55f17a91573a
tcmalloc: large alloc 2147893248 bytes == 0x55f3bfbd4000 @ 0x7fb67d86c615 0x55f17a91102c 0x55f17a9f117a 0x55f17aa29953 0x55f17a914f2e 0x55f17a914e59 0x55f17a8b635e 0x55f17a9c5a0f 0x55f17a9c5ad8 0x55f17a9c744b 0x55f17a9c6537 0x55f17a8b3f75 0x55f17a9c75dc 0x55f17a9c620a 0x55f17a9c5c58 0x55f17a9c744b 0x55f17a9c620a 0x55f17a9c6968 0x55f17a9c68d1 0x55f17a9c68d1 0x55f17a9c6968 0x55f17a9c516c 0x55f17aa5ba2b 0x55f17a913e4d 0x55f17aa05c0d 0x55f17a9880d8 0x55f17a983235 0x55f17a91573a 0x55f17a983b0e 0x55f17a982c35 0x55f17a91573a
tcmalloc: large alloc 2684870656 bytes == 0x55f43fc38000 @ 0x7fb67d86c615 0x55f17a91102c 0x55f17a9f117a 0x55f17aa29953 0x55f17a914f2e 0x55f17a914e59 0x55f17a8b635e 0x55f17a9c5a0f 0x55f17a9c5ad8 0x55f17a9c744b 0x55f17a9c6537 0x55f17a8b3f75 0x55f17a9c75dc 0x55f17a9c620a 0x55f17a9c5c58 0x55f17a9c744b 0x55f17a9c620a 0x55f17a9c6968 0x55f17a9c69b1 0x55f17a9c68d1 0x55f17a9c6968 0x55f17a9c516c 0x55f17aa5ba2b 0x55f17a913e4d 0x55f17aa05c0d 0x55f17a9880d8 0x55f17a983235 0x55f17a91573a 0x55f17a983b0e 0x55f17a982c35 0x55f17a91573a
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/accelerate_cli.py", line 41, in main
args.func(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 384, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.7/dist-packages/accelerate/commands/launch.py", line 142, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'bsmetadata/train.py', 'max_train_steps=50', 'num_eval=1', 'data_config.experiment=without_metadata', 'data_config.per_device_eval_batch_size=4', 'data_config.train_file=/content/drive/MyDrive/colab_data/bigscience/cc_news.jsonl', 'data_config.validation_split_percentage=1']' died with <Signals.SIGKILL: 9>.
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown
len(cache))