ray_lightning
ray_lightning copied to clipboard
GPU MisconfigurationException
I am now confused with how to set
resources_per_trial={ 'cpu': 4, 'gpu': 2 } I have 2 RTX 3090s and 64 CPUs.
I am passing
pl.Trainer(...gpus=2) However I am getting the following error from pytorch_lightning
self.accelerator_connector = AcceleratorConnector(
File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in __init__
self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus)
File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids
gpus = _sanitize_gpu_ids(gpus)
File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1]
But your machine only has: [0]
Oddly enough, this only happens when I use the raylightning plugin. Not sure what's going on
[0, 1] are both visible
Originally posted by @griff4692
@griff4692 can you post the full stack trace for this please along with your full code if possible?
Loaded CUI vocab of size=74814 Num GPUs --> 2 Num workers --> 64 2021-04-06 13:13:49,051 INFO services.py:1172 -- View the Ray dashboard at http://127.0.0.1:8273 2021-04-06 13:13:50,617 WARNING function_runner.py:540 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be
func(config, checkpoint_dir=None)`.
== Status ==
Memory usage on this node: 48.8/125.7 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 8/64 CPUs, 0.25/2 GPUs, 0.0/48.14 GiB heap, 0.0/16.6 GiB objects (0/1.0 accelerator_type:RTX)
Result logdir: /home/griffin/ray_results/tune_sent_ent_aligner
Number of trials: 1/10 (1 RUNNING)
+--------------------+----------+-------+--------------+-------------+------------+-------------+--------------+
| Trial name | status | loc | output_dim | embed_dim | cui_lr | sent_lr | batch_size |
|--------------------+----------+-------+--------------+-------------+------------+-------------+--------------|
| _inner_74d84_00000 | RUNNING | | 200 | 100 | 0.00391355 | 7.33826e-05 | 32 |
+--------------------+----------+-------+--------------+-------------+------------+-------------+--------------+
(pid=1521963) Changing embed_dim from 100 to 300 (pid=1521963) Changing output_dim from 200 to 300 (pid=1521963) Changing sent_lr from 3e-05 to 1.652883812828233e-05 (pid=1521963) Changing cui_lr from 0.001 to 0.0026378939196798307 (pid=1521961) Changing sent_lr from 3e-05 to 7.338258644223258e-05 (pid=1521961) Changing cui_lr from 0.001 to 0.003913548144199003 (pid=1521961) Changing batch_size from 128 to 32 (pid=1521988) Changing embed_dim from 100 to 200 (pid=1521988) Changing output_dim from 200 to 100 (pid=1521988) Changing sent_lr from 3e-05 to 5.059410831951854e-05 (pid=1521988) Changing cui_lr from 0.001 to 0.00469312032251949 (pid=1521988) Changing batch_size from 128 to 32 (pid=1521999) Changing output_dim from 200 to 100 (pid=1521999) Changing sent_lr from 3e-05 to 1.5661503073099256e-05 (pid=1521999) Changing cui_lr from 0.001 to 0.0024815254177106465 (pid=1521966) Changing embed_dim from 100 to 300 (pid=1521966) Changing sent_lr from 3e-05 to 8.501012554229748e-05 (pid=1521966) Changing cui_lr from 0.001 to 0.004414751889797149 (pid=1521966) Changing batch_size from 128 to 64 (pid=1521965) Changing output_dim from 200 to 100 (pid=1521965) Changing sent_lr from 3e-05 to 2.833667298228741e-05 (pid=1521965) Changing cui_lr from 0.001 to 0.0010980645992781167 (pid=1521965) Changing batch_size from 128 to 32 (pid=1522026) Changing sent_lr from 3e-05 to 3.425053257256413e-05 (pid=1522026) Changing cui_lr from 0.001 to 0.0015645638046684058 (pid=1522026) Changing batch_size from 128 to 64 (pid=1521992) Changing embed_dim from 100 to 300 (pid=1521992) Changing sent_lr from 3e-05 to 1.973228894974385e-05 (pid=1521992) Changing cui_lr from 0.001 to 0.0005048808143862787 (pid=1521961) Loading train dataset... (pid=1521961) Loading validation dataset... (pid=1521963) Loading train dataset... (pid=1521963) Loading validation dataset... (pid=1521988) Loading train dataset... (pid=1521988) Loading validation dataset... (pid=1521999) Loading train dataset... (pid=1521999) Loading validation dataset... (pid=1521966) Loading train dataset... (pid=1521966) Loading validation dataset... (pid=1521965) Loading train dataset... (pid=1521965) Loading validation dataset... (pid=1521961) 2021-04-06 13:13:53,452 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521961) Traceback (most recent call last): (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521961) self._entrypoint() (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521961) return self._trainable_func(self.config, self._status_reporter, (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521961) output = fn() (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521961) inner(config, checkpoint_dir=None) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521961) fn(config, **fn_kwargs) (pid=1521961) File "train.py", line 122, in train (pid=1521961) trainer = pl.Trainer( (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521961) return fn(self, **kwargs) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521961) self.accelerator_connector = AcceleratorConnector( (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521961) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521961) gpus = _sanitize_gpu_ids(gpus) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521961) raise MisconfigurationException( (pid=1521961) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521961) But your machine only has: [0] (pid=1521961) Exception in thread Thread-2: (pid=1521961) Traceback (most recent call last): (pid=1521961) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521961) self.run() (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521961) raise e (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521961) self._entrypoint() (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521961) return self._trainable_func(self.config, self._status_reporter, (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521961) output = fn() (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521961) inner(config, checkpoint_dir=None) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521961) fn(config, **fn_kwargs) (pid=1521961) File "train.py", line 122, in train (pid=1521961) trainer = pl.Trainer( (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521961) return fn(self, **kwargs) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521961) self.accelerator_connector = AcceleratorConnector( (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521961) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521961) gpus = _sanitize_gpu_ids(gpus) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521961) raise MisconfigurationException( (pid=1521961) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521961) But your machine only has: [0] (pid=1521999) 2021-04-06 13:13:53,477 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521999) Traceback (most recent call last): (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521999) self._entrypoint() (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521999) return self._trainable_func(self.config, self._status_reporter, (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521999) output = fn() (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521999) inner(config, checkpoint_dir=None) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521999) fn(config, **fn_kwargs) (pid=1521999) File "train.py", line 122, in train (pid=1521999) trainer = pl.Trainer( (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521999) return fn(self, **kwargs) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521999) self.accelerator_connector = AcceleratorConnector( (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521999) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521999) gpus = _sanitize_gpu_ids(gpus) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521999) raise MisconfigurationException( (pid=1521999) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521999) But your machine only has: [0] (pid=1521999) Exception in thread Thread-2: (pid=1521999) Traceback (most recent call last): (pid=1521999) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521999) self.run() (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521999) raise e (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521999) self._entrypoint() (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521999) return self._trainable_func(self.config, self._status_reporter, (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521999) output = fn() (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521999) inner(config, checkpoint_dir=None) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521999) fn(config, **fn_kwargs) (pid=1521999) File "train.py", line 122, in train (pid=1521999) trainer = pl.Trainer( (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521999) return fn(self, **kwargs) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521999) self.accelerator_connector = AcceleratorConnector( (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521999) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521999) gpus = _sanitize_gpu_ids(gpus) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521999) raise MisconfigurationException( (pid=1521999) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521999) But your machine only has: [0] (pid=1522026) Loading train dataset... (pid=1522026) Loading validation dataset... (pid=1521965) 2021-04-06 13:13:53,489 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521965) Traceback (most recent call last): (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521965) self._entrypoint() (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521965) return self._trainable_func(self.config, self._status_reporter, (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521965) output = fn() (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521965) inner(config, checkpoint_dir=None) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521965) fn(config, **fn_kwargs) (pid=1521965) File "train.py", line 122, in train (pid=1521965) trainer = pl.Trainer( (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521965) return fn(self, **kwargs) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521965) self.accelerator_connector = AcceleratorConnector( (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521965) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521965) gpus = _sanitize_gpu_ids(gpus) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521965) raise MisconfigurationException( (pid=1521965) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521965) But your machine only has: [0] (pid=1521965) Exception in thread Thread-2: (pid=1521965) Traceback (most recent call last): (pid=1521965) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521965) self.run() (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521965) raise e (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521965) self._entrypoint() (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521965) return self._trainable_func(self.config, self._status_reporter, (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521965) output = fn() (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521965) inner(config, checkpoint_dir=None) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521965) fn(config, **fn_kwargs) (pid=1521965) File "train.py", line 122, in train (pid=1521965) trainer = pl.Trainer( (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521965) return fn(self, **kwargs) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521965) self.accelerator_connector = AcceleratorConnector( (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521965) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521965) gpus = _sanitize_gpu_ids(gpus) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521965) raise MisconfigurationException( (pid=1521965) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521965) But your machine only has: [0] (pid=1521988) 2021-04-06 13:13:53,491 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521988) Traceback (most recent call last): (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521988) self._entrypoint() (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521988) return self._trainable_func(self.config, self._status_reporter, (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521988) output = fn() (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521988) inner(config, checkpoint_dir=None) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521988) fn(config, **fn_kwargs) (pid=1521988) File "train.py", line 122, in train (pid=1521988) trainer = pl.Trainer( (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521988) return fn(self, **kwargs) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521988) self.accelerator_connector = AcceleratorConnector( (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521988) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521988) gpus = _sanitize_gpu_ids(gpus) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521988) raise MisconfigurationException( (pid=1521988) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521988) But your machine only has: [0] (pid=1521988) Exception in thread Thread-2: (pid=1521988) Traceback (most recent call last): (pid=1521988) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521988) self.run() (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521988) raise e (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521988) self._entrypoint() (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521988) return self._trainable_func(self.config, self._status_reporter, (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521988) output = fn() (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521988) inner(config, checkpoint_dir=None) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521988) fn(config, **fn_kwargs) (pid=1521988) File "train.py", line 122, in train (pid=1521988) trainer = pl.Trainer( (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521988) return fn(self, **kwargs) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521988) self.accelerator_connector = AcceleratorConnector( (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521988) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521988) gpus = _sanitize_gpu_ids(gpus) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521988) raise MisconfigurationException( (pid=1521988) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521988) But your machine only has: [0] (pid=1521966) 2021-04-06 13:13:53,512 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521966) Traceback (most recent call last): (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521966) self._entrypoint() (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521966) return self._trainable_func(self.config, self._status_reporter, (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521966) output = fn() (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521966) inner(config, checkpoint_dir=None) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521966) fn(config, **fn_kwargs) (pid=1521966) File "train.py", line 122, in train (pid=1521966) trainer = pl.Trainer( (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521966) return fn(self, **kwargs) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521966) self.accelerator_connector = AcceleratorConnector( (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521966) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521966) gpus = _sanitize_gpu_ids(gpus) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521966) raise MisconfigurationException( (pid=1521966) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521966) But your machine only has: [0] (pid=1521966) Exception in thread Thread-2: (pid=1521966) Traceback (most recent call last): (pid=1521966) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521966) self.run() (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521966) raise e (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521966) self._entrypoint() (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521966) return self._trainable_func(self.config, self._status_reporter, (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521966) output = fn() (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521966) inner(config, checkpoint_dir=None) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521966) fn(config, **fn_kwargs) (pid=1521966) File "train.py", line 122, in train (pid=1521966) trainer = pl.Trainer( (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521966) return fn(self, **kwargs) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521966) self.accelerator_connector = AcceleratorConnector( (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521966) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521966) gpus = _sanitize_gpu_ids(gpus) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521966) raise MisconfigurationException( (pid=1521966) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521966) But your machine only has: [0] (pid=1522026) 2021-04-06 13:13:53,515 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1522026) Traceback (most recent call last): (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1522026) self._entrypoint() (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1522026) return self._trainable_func(self.config, self._status_reporter, (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1522026) output = fn() (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1522026) inner(config, checkpoint_dir=None) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1522026) fn(config, **fn_kwargs) (pid=1522026) File "train.py", line 122, in train (pid=1522026) trainer = pl.Trainer( (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1522026) return fn(self, **kwargs) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1522026) self.accelerator_connector = AcceleratorConnector( (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1522026) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1522026) gpus = _sanitize_gpu_ids(gpus) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1522026) raise MisconfigurationException( (pid=1522026) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1522026) But your machine only has: [0] (pid=1522026) Exception in thread Thread-2: (pid=1522026) Traceback (most recent call last): (pid=1522026) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1522026) self.run() (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1522026) raise e (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1522026) self._entrypoint() (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1522026) return self._trainable_func(self.config, self._status_reporter, (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1522026) output = fn() (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1522026) inner(config, checkpoint_dir=None) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1522026) fn(config, **fn_kwargs) (pid=1522026) File "train.py", line 122, in train (pid=1522026) trainer = pl.Trainer( (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1522026) return fn(self, **kwargs) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1522026) self.accelerator_connector = AcceleratorConnector( (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1522026) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1522026) gpus = _sanitize_gpu_ids(gpus) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1522026) raise MisconfigurationException( (pid=1522026) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1522026) But your machine only has: [0] (pid=1521963) 2021-04-06 13:13:53,498 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521963) Traceback (most recent call last): (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521963) self._entrypoint() (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521963) return self._trainable_func(self.config, self._status_reporter, (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521963) output = fn() (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521963) inner(config, checkpoint_dir=None) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521963) fn(config, **fn_kwargs) (pid=1521963) File "train.py", line 122, in train (pid=1521963) trainer = pl.Trainer( (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521963) return fn(self, **kwargs) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521963) self.accelerator_connector = AcceleratorConnector( (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521963) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521963) gpus = _sanitize_gpu_ids(gpus) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521963) raise MisconfigurationException( (pid=1521963) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521963) But your machine only has: [0] (pid=1521963) Exception in thread Thread-2: (pid=1521963) Traceback (most recent call last): (pid=1521963) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521963) self.run() (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521963) raise e (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521963) self._entrypoint() (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521963) return self._trainable_func(self.config, self._status_reporter, (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521963) output = fn() (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521963) inner(config, checkpoint_dir=None) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521963) fn(config, **fn_kwargs) (pid=1521963) File "train.py", line 122, in train (pid=1521963) trainer = pl.Trainer( (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521963) return fn(self, **kwargs) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521963) self.accelerator_connector = AcceleratorConnector( (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521963) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521963) gpus = _sanitize_gpu_ids(gpus) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521963) raise MisconfigurationException( (pid=1521963) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521963) But your machine only has: [0] 2021-04-06 13:13:53,593 ERROR trial_runner.py:616 -- Trial _inner_74d84_00001: Error processing event. Traceback (most recent call last): File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial results = self.trial_executor.fetch_result(trial) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/home/griffin/cl/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper return func(*args, **kwargs) File "/home/griffin/cl/lib/python3.8/site-packages/ray/worker.py", line 1456, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1521963, ip=192.168.1.171) File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/trainable.py", line 167, in train_buffered result = self.train() File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/trainable.py", line 226, in train result = self.step() File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 366, in step self._report_thread_runner_error(block=True) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 512, in _report_thread_runner_error raise TuneError( ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=1521963, ip=192.168.1.171) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run self._entrypoint() File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint return self._trainable_func(self.config, self._status_reporter, File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func output = fn() File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner inner(config, checkpoint_dir=None) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner fn(config, **fn_kwargs) File "train.py", line 122, in train trainer = pl.Trainer( File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults return fn(self, **kwargs) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init self.accelerator_connector = AcceleratorConnector( File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids gpus = _sanitize_gpu_ids(gpus) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids raise MisconfigurationException( pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] But your machine only has: [0] Result for _inner_74d84_00001: {}
== Status == Memory usage on this node: 55.2/125.7 GiB Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Resources requested: 0/64 CPUs, 0.0/2 GPUs, 0.0/48.14 GiB heap, 0.0/16.6 GiB objects (0/1.0 accelerator_type:RTX) Result logdir: /home/griffin/ray_results/tune_sent_ent_aligner Number of trials: 9/10 (1 ERROR, 8 TERMINATED) +--------------------+------------+-------+--------------+-------------+-------------+-------------+--------------+ | Trial name | status | loc | output_dim | embed_dim | cui_lr | sent_lr | batch_size | |--------------------+------------+-------+--------------+-------------+-------------+-------------+--------------| | _inner_74d84_00000 | TERMINATED | | 200 | 100 | 0.00391355 | 7.33826e-05 | 32 | | _inner_74d84_00002 | TERMINATED | | 200 | 300 | 0.00441475 | 8.50101e-05 | 64 | | _inner_74d84_00003 | TERMINATED | | 100 | 100 | 0.00109806 | 2.83367e-05 | 32 | | _inner_74d84_00004 | TERMINATED | | 200 | 100 | 0.00156456 | 3.42505e-05 | 64 | | _inner_74d84_00005 | TERMINATED | | 200 | 300 | 0.000504881 | 1.97323e-05 | 128 | | _inner_74d84_00006 | TERMINATED | | 100 | 200 | 0.00469312 | 5.05941e-05 | 32 | | _inner_74d84_00007 | TERMINATED | | 100 | 100 | 0.00248153 | 1.56615e-05 | 128 | | _inner_74d84_00008 | TERMINATED | | 100 | 300 | 0.0022021 | 4.0612e-05 | 64 | | _inner_74d84_00001 | ERROR | | 300 | 300 | 0.00263789 | 1.65288e-05 | 128 | +--------------------+------------+-------+--------------+-------------+-------------+-------------+--------------+ Number of errored trials: 1 +--------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |--------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | _inner_74d84_00001 | 1 | /home/griffin/ray_results/tune_sent_ent_aligner/_inner_74d84_00001_1_batch_size=128,cui_lr=0.0026379,embed_dim=300,output_dim=300,sent_lr=1.6529e-05_2021-04-06_13-13-50/error.txt | +--------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
(pid=1521992) Loading train dataset...
(pid=1521992) Loading validation dataset...
(pid=1521992) 2021-04-06 13:13:53,622 ERROR function_runner.py:254 -- Runner Thread raised error.
(pid=1521992) Traceback (most recent call last):
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run
(pid=1521992) self._entrypoint()
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint
(pid=1521992) return self._trainable_func(self.config, self._status_reporter,
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func
(pid=1521992) output = fn()
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner
(pid=1521992) inner(config, checkpoint_dir=None)
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner
(pid=1521992) fn(config, **fn_kwargs)
(pid=1521992) File "train.py", line 122, in train
(pid=1521992) trainer = pl.Trainer(
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
(pid=1521992) return fn(self, **kwargs)
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init
(pid=1521992) self.accelerator_connector = AcceleratorConnector(
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init
(pid=1521992) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus)
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids
(pid=1521992) gpus = _sanitize_gpu_ids(gpus)
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids
(pid=1521992) raise MisconfigurationException(
(pid=1521992) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1]
(pid=1521992) But your machine only has: [0]
(pid=1521992) Exception in thread Thread-2:
(pid=1521992) Traceback (most recent call last):
(pid=1521992) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
(pid=1521992) self.run()
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run
(pid=1521992) raise e
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run
(pid=1521992) self._entrypoint()
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint
(pid=1521992) return self._trainable_func(self.config, self._status_reporter,
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func
(pid=1521992) output = fn()
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner
(pid=1521992) inner(config, checkpoint_dir=None)
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner
(pid=1521992) fn(config, **fn_kwargs)
(pid=1521992) File "train.py", line 122, in train
(pid=1521992) trainer = pl.Trainer(
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
(pid=1521992) return fn(self, **kwargs)
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init
(pid=1521992) self.accelerator_connector = AcceleratorConnector(
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init
(pid=1521992) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus)
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids
(pid=1521992) gpus = _sanitize_gpu_ids(gpus)
(pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids
(pid=1521992) raise MisconfigurationException(
(pid=1521992) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1]
(pid=1521992) But your machine only has: [0]
Traceback (most recent call last):
File "train.py", line 191, in
Could you also share your code too please? Thanks!
`import os import pickle
import argparse import pytorch_lightning as pl from pytorch_lightning import loggers as pl_loggers from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint from ray import tune from ray.tune import CLIReporter from ray.tune.schedulers import ASHAScheduler, PopulationBasedTraining from ray.tune.integration.pytorch_lightning import TuneReportCallback, TuneReportCheckpointCallback from ray_lightning import RayPlugin import torch from transformers import AutoTokenizer
from preprocess.shared.shared_constants import BASE_DIR from ents.cui_vocab import Vocab from ents.data_utils import EntDataModule from ents.models.ent_sent_aligner import EntSentAligner from ents.models.huggingface_constants import BERT, RoBERTa
def tune_hparams(args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus, num_epochs=4): assert args.sent_encoder == 'bert' assert args.cui_encoder == 'lstm' config = { 'embed_dim': tune.choice([100, 200, 300]), 'output_dim': tune.choice([100, 200, 300]), 'sent_lr': tune.loguniform(1e-5, 1e-4), 'cui_lr': tune.loguniform(5e-4, 5e-3), 'batch_size': tune.choice([32, 64, 128]), }
scheduler = ASHAScheduler(
max_t=10,
grace_period=1,
reduction_factor=2
)
reporter = CLIReporter(
parameter_columns=['output_dim', 'embed_dim', 'cui_lr', 'sent_lr', 'batch_size'],
metric_columns=['val_loss', 'val_mrr', 'training_iteration'])
analysis = tune.run(
tune.with_parameters(
train,
args=args,
cui_vocab=cui_vocab,
tokenizer=tokenizer,
data_dir=data_dir,
num_workers=num_workers,
num_gpus=num_gpus,
num_epochs=num_epochs,
is_tune=True,
),
resources_per_trial={
'cpu': 8,
'gpu': 0.25
},
metric='val_mrr',
mode='max',
config=config,
num_samples=10,
scheduler=scheduler,
progress_reporter=reporter,
name='tune_sent_ent_aligner',
# https://wood-b.github.io/post/a-novices-guide-to-hyperparameter-optimization-at-scale/#how-to-select-an-hpo-strategy
checkpoint_at_end=False,
sync_on_checkpoint=False,
fail_fast=True,
# reuse_actors=True # Requires a reset_config function
)
print('Best hyperparameters found were: ', analysis.best_config)
def train_w_args(args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus): num_epochs = 100 if args.debug else 10 save_dir = os.path.expanduser('~/weights/ent_pretrain') logger = pl_loggers.WandbLogger( name=args.experiment, save_dir=save_dir, offline=args.debug or args.mini, project='entity-pretrain', entity='clinsum' )
logger.log_hyperparams(args)
train({}, args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus, num_epochs, logger=logger, is_tune=False)
def train(config, args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus, num_epochs, logger=None, is_tune=False): for k, v in config.items(): prev_v = getattr(args, k) if v != prev_v: setattr(args, k, v) print(f'Changing {k} from {prev_v} to {v}')
pin_memory = num_gpus is not None and not args.debug
precision = 16 if num_gpus is not None else 32
model = EntSentAligner(args, cui_vocab, tokenizer)
dataset = EntDataModule(
args, data_dir, cui_vocab, tokenizer, model.prepare_batch, num_workers=num_workers, pin_memory=pin_memory)
checkpoint_callback = ModelCheckpoint(
monitor='val_loss',
save_top_k=1,
save_last=True,
mode='min'
)
early_stopping = EarlyStopping('val_loss')
callbacks = [early_stopping, checkpoint_callback]
plugins = None
if is_tune:
tune_callback = TuneReportCallback({'val_loss': 'val_loss', 'val_mrr': 'val_mrr'}, on='validation_end')
callbacks.append(tune_callback)
plugins = [RayPlugin(num_workers=num_workers, use_gpu=True)]
experiment_dir = os.path.expanduser(os.path.join('~/weights/ents', args.experiment))
trainer = pl.Trainer(
logger=logger,
callbacks=None if args.debug or args.mini else callbacks,
precision=precision,
gpus=num_gpus,
accelerator=None if num_gpus is None or num_gpus == 1 else 'ddp',
terminate_on_nan=True,
val_check_interval=1.0 if args.debug else 0.25,
default_root_dir=experiment_dir,
plugins=plugins,
max_epochs=num_epochs,
overfit_batches=1 if args.debug else 0.0,
# gradient_clip_val=0,
# https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html
# stochastic_weight_avg=True
)
print('Starting training...')
trainer.fit(model, dataset)
if name == 'main': parser = argparse.ArgumentParser('Train Sent-Aligner') parser.add_argument('--batch_size', default=128, type=int) parser.add_argument('--cui_lr', default=1e-3, type=float) parser.add_argument('--sent_lr', default=3e-5, type=float, help='Should be lower than CUI LR when using a pre-trained transformer') parser.add_argument('--output_dim', default=200, type=int) parser.add_argument('--embed_dim', default=100, type=int) parser.add_argument('-tune', default=False, action='store_true') parser.add_argument('--experiment', default='default') parser.add_argument('-debug', action='store_true', default=False) parser.add_argument('-mini', action='store_true', default=False) parser.add_argument('-dsum_only', action='store_true', default=True) parser.add_argument('--sent_encoder', default='bert', choices=['lstm', 'roberta', 'bert']) parser.add_argument('--cui_encoder', default='lstm', choices=['lstm', 'transformer']) parser.add_argument('--max_cui_seq_len', default=512, choices=[128, 256, 512]) parser.add_argument('--objective', default='s|e', choices=['s|e', 'e|s']) parser.add_argument('-cpu', default=False, action='store_true') parser.add_argument('--cui_transformer_layers', default=4, type=int, help='Used if --cui_encoder=transformer') parser.add_argument('--cui_att_heads', default=4, type=int, help='Used if --cui_encoder=transformer')
args = parser.parse_args()
if args.mini:
args.experiment = 'default'
if args.debug:
args.mini = True
if args.sent_encoder == 'roberta':
raise Exception('RoBERTa currently not supported.')
cui_vocab_fn = os.path.join(BASE_DIR, 'ent_pretrain', 'cui_vocab.pk')
model_str = BERT if args.sent_encoder == 'bert' else RoBERTa
tokenizer = AutoTokenizer.from_pretrained(model_str)
with open(cui_vocab_fn, 'rb') as fd:
cui_vocab = pickle.load(fd)
cui_v = len(cui_vocab)
print('Loaded CUI vocab of size={}'.format(cui_v))
data_dir = os.path.join(BASE_DIR, 'ent_pretrain')
num_gpus = torch.cuda.device_count() if torch.cuda.is_available() and not args.cpu else None
if num_gpus is not None and args.debug:
num_gpus = 1
num_workers = round(os.cpu_count() * 1.0)
print('Num GPUs --> {}'.format(num_gpus))
print('Num workers --> {}'.format(num_workers))
if args.tune:
tune_hparams(args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus)
exit(0)
train_w_args(args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus)
`
the output of
list(range(torch.cuda.device_count()))
is [0, 1] so I'm confused as to how it comes up [0] when I use the ray-lightning code
Ah so the problem is that the gpus
passed into pl.Trainer
is not the same as the gpus
set in resources_per_trial
. Can you change the gpus
passed into pl.Trainer
to be 1? That will fix this issue, but your code still won't work fully. To fix new the one that comes up, you have to modify your resources_per_trial
to be like what's in the examples:
resources_per_trial={
"cpu": 1,
"gpu": int(use_gpu),
"extra_cpu": num_workers,
"extra_gpu": num_workers * int(use_gpu)
}
Note that this will result in 1 GPU per trial not being utilized. I am currently working on a fix for this, but if you want a hacky way to avoid this, you can follow the thread here https://github.com/ray-project/ray_lightning/issues/23.
Thanks - I changed everything to 1
resources_per_trial={
'cpu': 1,
'gpu': 1,
'extra_cpu': 1,
'extra_gpu': 1
},
and num_gpus to pl.Trainer to "1" as well and am getting...
(pid=1545888) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] 2021-04-06 14:37:01,746 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffffaca3ed3a841b06585769c4b801000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {62.000000/64.000000 CPU, 31.933594 GiB/31.933594 GiB memory, 0.000000/2.000000 GPU, 1.000000/1.000000 node:192.168.1.171, 10.986328 GiB/10.986328 GiB object_store_memory, 1.000000/1.000000 accelerator_type:RTX} . In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
How many workers are you using (num_workers
passed into RayPlugin
)? In a 2 GPU machine, with 1 GPU not being utilized (see https://github.com/ray-project/ray_lightning/issues/32#issuecomment-814313347), means that you can run a maximum of 1 worker.
I am using 1 worker as well
Looking at your code it seems like num_workers is being set to the CPU count, is that correct? num_workers = round(os.cpu_count() * 1.0)
. Wouldn't that set num_workers
to 64?
Sorry I removed that code. so now num-workers is set to 1. It is actually working now, but I believe only running a single task at a time and using a single GPU. what is the advantage then over not using this Plugin?
Great, glad it's working now!
So this plugin is useful for general-purpose distributed training with Pytorch Lightning on either single-node or a large multi-node cluster. When used with Tune, it also allows each Tune trial to be run in a distributed fashion.
I am currently working on a fix for the unutilized GPU issue, but if you would like, you can follow the thread in #23 for a hacky fix. With that fix, you would be able to run Tune with each trial using 2 workers, which is not possible without this plugin. Or even without that fix, you could use more workers if you have more GPUs. But you're right, as it stands, when used with Tune with just 1 worker, there's no significant benefit of using this plugin over standard Tune+Pytorch Lightning