ray_lightning icon indicating copy to clipboard operation
ray_lightning copied to clipboard

GPU MisconfigurationException

Open amogkam opened this issue 3 years ago • 12 comments

I am now confused with how to set

resources_per_trial={ 'cpu': 4, 'gpu': 2 } I have 2 RTX 3090s and 64 CPUs.

I am passing

pl.Trainer(...gpus=2) However I am getting the following error from pytorch_lightning

    self.accelerator_connector = AcceleratorConnector(
  File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in __init__
    self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus)
  File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids
    gpus = _sanitize_gpu_ids(gpus)
  File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1]
 But your machine only has: [0]

Oddly enough, this only happens when I use the raylightning plugin. Not sure what's going on

[0, 1] are both visible

Originally posted by @griff4692

amogkam avatar Apr 06 '21 17:04 amogkam

@griff4692 can you post the full stack trace for this please along with your full code if possible?

amogkam avatar Apr 06 '21 17:04 amogkam

Loaded CUI vocab of size=74814 Num GPUs --> 2 Num workers --> 64 2021-04-06 13:13:49,051 INFO services.py:1172 -- View the Ray dashboard at http://127.0.0.1:8273 2021-04-06 13:13:50,617 WARNING function_runner.py:540 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be func(config, checkpoint_dir=None)`. == Status == Memory usage on this node: 48.8/125.7 GiB Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Resources requested: 8/64 CPUs, 0.25/2 GPUs, 0.0/48.14 GiB heap, 0.0/16.6 GiB objects (0/1.0 accelerator_type:RTX) Result logdir: /home/griffin/ray_results/tune_sent_ent_aligner Number of trials: 1/10 (1 RUNNING) +--------------------+----------+-------+--------------+-------------+------------+-------------+--------------+ | Trial name | status | loc | output_dim | embed_dim | cui_lr | sent_lr | batch_size | |--------------------+----------+-------+--------------+-------------+------------+-------------+--------------| | _inner_74d84_00000 | RUNNING | | 200 | 100 | 0.00391355 | 7.33826e-05 | 32 | +--------------------+----------+-------+--------------+-------------+------------+-------------+--------------+

(pid=1521963) Changing embed_dim from 100 to 300 (pid=1521963) Changing output_dim from 200 to 300 (pid=1521963) Changing sent_lr from 3e-05 to 1.652883812828233e-05 (pid=1521963) Changing cui_lr from 0.001 to 0.0026378939196798307 (pid=1521961) Changing sent_lr from 3e-05 to 7.338258644223258e-05 (pid=1521961) Changing cui_lr from 0.001 to 0.003913548144199003 (pid=1521961) Changing batch_size from 128 to 32 (pid=1521988) Changing embed_dim from 100 to 200 (pid=1521988) Changing output_dim from 200 to 100 (pid=1521988) Changing sent_lr from 3e-05 to 5.059410831951854e-05 (pid=1521988) Changing cui_lr from 0.001 to 0.00469312032251949 (pid=1521988) Changing batch_size from 128 to 32 (pid=1521999) Changing output_dim from 200 to 100 (pid=1521999) Changing sent_lr from 3e-05 to 1.5661503073099256e-05 (pid=1521999) Changing cui_lr from 0.001 to 0.0024815254177106465 (pid=1521966) Changing embed_dim from 100 to 300 (pid=1521966) Changing sent_lr from 3e-05 to 8.501012554229748e-05 (pid=1521966) Changing cui_lr from 0.001 to 0.004414751889797149 (pid=1521966) Changing batch_size from 128 to 64 (pid=1521965) Changing output_dim from 200 to 100 (pid=1521965) Changing sent_lr from 3e-05 to 2.833667298228741e-05 (pid=1521965) Changing cui_lr from 0.001 to 0.0010980645992781167 (pid=1521965) Changing batch_size from 128 to 32 (pid=1522026) Changing sent_lr from 3e-05 to 3.425053257256413e-05 (pid=1522026) Changing cui_lr from 0.001 to 0.0015645638046684058 (pid=1522026) Changing batch_size from 128 to 64 (pid=1521992) Changing embed_dim from 100 to 300 (pid=1521992) Changing sent_lr from 3e-05 to 1.973228894974385e-05 (pid=1521992) Changing cui_lr from 0.001 to 0.0005048808143862787 (pid=1521961) Loading train dataset... (pid=1521961) Loading validation dataset... (pid=1521963) Loading train dataset... (pid=1521963) Loading validation dataset... (pid=1521988) Loading train dataset... (pid=1521988) Loading validation dataset... (pid=1521999) Loading train dataset... (pid=1521999) Loading validation dataset... (pid=1521966) Loading train dataset... (pid=1521966) Loading validation dataset... (pid=1521965) Loading train dataset... (pid=1521965) Loading validation dataset... (pid=1521961) 2021-04-06 13:13:53,452 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521961) Traceback (most recent call last): (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521961) self._entrypoint() (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521961) return self._trainable_func(self.config, self._status_reporter, (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521961) output = fn() (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521961) inner(config, checkpoint_dir=None) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521961) fn(config, **fn_kwargs) (pid=1521961) File "train.py", line 122, in train (pid=1521961) trainer = pl.Trainer( (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521961) return fn(self, **kwargs) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521961) self.accelerator_connector = AcceleratorConnector( (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521961) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521961) gpus = _sanitize_gpu_ids(gpus) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521961) raise MisconfigurationException( (pid=1521961) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521961) But your machine only has: [0] (pid=1521961) Exception in thread Thread-2: (pid=1521961) Traceback (most recent call last): (pid=1521961) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521961) self.run() (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521961) raise e (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521961) self._entrypoint() (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521961) return self._trainable_func(self.config, self._status_reporter, (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521961) output = fn() (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521961) inner(config, checkpoint_dir=None) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521961) fn(config, **fn_kwargs) (pid=1521961) File "train.py", line 122, in train (pid=1521961) trainer = pl.Trainer( (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521961) return fn(self, **kwargs) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521961) self.accelerator_connector = AcceleratorConnector( (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521961) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521961) gpus = _sanitize_gpu_ids(gpus) (pid=1521961) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521961) raise MisconfigurationException( (pid=1521961) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521961) But your machine only has: [0] (pid=1521999) 2021-04-06 13:13:53,477 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521999) Traceback (most recent call last): (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521999) self._entrypoint() (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521999) return self._trainable_func(self.config, self._status_reporter, (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521999) output = fn() (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521999) inner(config, checkpoint_dir=None) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521999) fn(config, **fn_kwargs) (pid=1521999) File "train.py", line 122, in train (pid=1521999) trainer = pl.Trainer( (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521999) return fn(self, **kwargs) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521999) self.accelerator_connector = AcceleratorConnector( (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521999) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521999) gpus = _sanitize_gpu_ids(gpus) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521999) raise MisconfigurationException( (pid=1521999) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521999) But your machine only has: [0] (pid=1521999) Exception in thread Thread-2: (pid=1521999) Traceback (most recent call last): (pid=1521999) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521999) self.run() (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521999) raise e (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521999) self._entrypoint() (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521999) return self._trainable_func(self.config, self._status_reporter, (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521999) output = fn() (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521999) inner(config, checkpoint_dir=None) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521999) fn(config, **fn_kwargs) (pid=1521999) File "train.py", line 122, in train (pid=1521999) trainer = pl.Trainer( (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521999) return fn(self, **kwargs) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521999) self.accelerator_connector = AcceleratorConnector( (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521999) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521999) gpus = _sanitize_gpu_ids(gpus) (pid=1521999) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521999) raise MisconfigurationException( (pid=1521999) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521999) But your machine only has: [0] (pid=1522026) Loading train dataset... (pid=1522026) Loading validation dataset... (pid=1521965) 2021-04-06 13:13:53,489 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521965) Traceback (most recent call last): (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521965) self._entrypoint() (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521965) return self._trainable_func(self.config, self._status_reporter, (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521965) output = fn() (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521965) inner(config, checkpoint_dir=None) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521965) fn(config, **fn_kwargs) (pid=1521965) File "train.py", line 122, in train (pid=1521965) trainer = pl.Trainer( (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521965) return fn(self, **kwargs) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521965) self.accelerator_connector = AcceleratorConnector( (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521965) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521965) gpus = _sanitize_gpu_ids(gpus) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521965) raise MisconfigurationException( (pid=1521965) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521965) But your machine only has: [0] (pid=1521965) Exception in thread Thread-2: (pid=1521965) Traceback (most recent call last): (pid=1521965) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521965) self.run() (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521965) raise e (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521965) self._entrypoint() (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521965) return self._trainable_func(self.config, self._status_reporter, (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521965) output = fn() (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521965) inner(config, checkpoint_dir=None) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521965) fn(config, **fn_kwargs) (pid=1521965) File "train.py", line 122, in train (pid=1521965) trainer = pl.Trainer( (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521965) return fn(self, **kwargs) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521965) self.accelerator_connector = AcceleratorConnector( (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521965) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521965) gpus = _sanitize_gpu_ids(gpus) (pid=1521965) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521965) raise MisconfigurationException( (pid=1521965) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521965) But your machine only has: [0] (pid=1521988) 2021-04-06 13:13:53,491 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521988) Traceback (most recent call last): (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521988) self._entrypoint() (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521988) return self._trainable_func(self.config, self._status_reporter, (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521988) output = fn() (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521988) inner(config, checkpoint_dir=None) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521988) fn(config, **fn_kwargs) (pid=1521988) File "train.py", line 122, in train (pid=1521988) trainer = pl.Trainer( (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521988) return fn(self, **kwargs) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521988) self.accelerator_connector = AcceleratorConnector( (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521988) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521988) gpus = _sanitize_gpu_ids(gpus) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521988) raise MisconfigurationException( (pid=1521988) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521988) But your machine only has: [0] (pid=1521988) Exception in thread Thread-2: (pid=1521988) Traceback (most recent call last): (pid=1521988) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521988) self.run() (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521988) raise e (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521988) self._entrypoint() (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521988) return self._trainable_func(self.config, self._status_reporter, (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521988) output = fn() (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521988) inner(config, checkpoint_dir=None) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521988) fn(config, **fn_kwargs) (pid=1521988) File "train.py", line 122, in train (pid=1521988) trainer = pl.Trainer( (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521988) return fn(self, **kwargs) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521988) self.accelerator_connector = AcceleratorConnector( (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521988) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521988) gpus = _sanitize_gpu_ids(gpus) (pid=1521988) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521988) raise MisconfigurationException( (pid=1521988) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521988) But your machine only has: [0] (pid=1521966) 2021-04-06 13:13:53,512 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521966) Traceback (most recent call last): (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521966) self._entrypoint() (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521966) return self._trainable_func(self.config, self._status_reporter, (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521966) output = fn() (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521966) inner(config, checkpoint_dir=None) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521966) fn(config, **fn_kwargs) (pid=1521966) File "train.py", line 122, in train (pid=1521966) trainer = pl.Trainer( (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521966) return fn(self, **kwargs) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521966) self.accelerator_connector = AcceleratorConnector( (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521966) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521966) gpus = _sanitize_gpu_ids(gpus) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521966) raise MisconfigurationException( (pid=1521966) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521966) But your machine only has: [0] (pid=1521966) Exception in thread Thread-2: (pid=1521966) Traceback (most recent call last): (pid=1521966) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521966) self.run() (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521966) raise e (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521966) self._entrypoint() (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521966) return self._trainable_func(self.config, self._status_reporter, (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521966) output = fn() (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521966) inner(config, checkpoint_dir=None) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521966) fn(config, **fn_kwargs) (pid=1521966) File "train.py", line 122, in train (pid=1521966) trainer = pl.Trainer( (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521966) return fn(self, **kwargs) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521966) self.accelerator_connector = AcceleratorConnector( (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521966) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521966) gpus = _sanitize_gpu_ids(gpus) (pid=1521966) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521966) raise MisconfigurationException( (pid=1521966) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521966) But your machine only has: [0] (pid=1522026) 2021-04-06 13:13:53,515 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1522026) Traceback (most recent call last): (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1522026) self._entrypoint() (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1522026) return self._trainable_func(self.config, self._status_reporter, (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1522026) output = fn() (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1522026) inner(config, checkpoint_dir=None) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1522026) fn(config, **fn_kwargs) (pid=1522026) File "train.py", line 122, in train (pid=1522026) trainer = pl.Trainer( (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1522026) return fn(self, **kwargs) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1522026) self.accelerator_connector = AcceleratorConnector( (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1522026) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1522026) gpus = _sanitize_gpu_ids(gpus) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1522026) raise MisconfigurationException( (pid=1522026) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1522026) But your machine only has: [0] (pid=1522026) Exception in thread Thread-2: (pid=1522026) Traceback (most recent call last): (pid=1522026) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1522026) self.run() (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1522026) raise e (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1522026) self._entrypoint() (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1522026) return self._trainable_func(self.config, self._status_reporter, (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1522026) output = fn() (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1522026) inner(config, checkpoint_dir=None) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1522026) fn(config, **fn_kwargs) (pid=1522026) File "train.py", line 122, in train (pid=1522026) trainer = pl.Trainer( (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1522026) return fn(self, **kwargs) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1522026) self.accelerator_connector = AcceleratorConnector( (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1522026) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1522026) gpus = _sanitize_gpu_ids(gpus) (pid=1522026) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1522026) raise MisconfigurationException( (pid=1522026) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1522026) But your machine only has: [0] (pid=1521963) 2021-04-06 13:13:53,498 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521963) Traceback (most recent call last): (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521963) self._entrypoint() (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521963) return self._trainable_func(self.config, self._status_reporter, (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521963) output = fn() (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521963) inner(config, checkpoint_dir=None) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521963) fn(config, **fn_kwargs) (pid=1521963) File "train.py", line 122, in train (pid=1521963) trainer = pl.Trainer( (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521963) return fn(self, **kwargs) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521963) self.accelerator_connector = AcceleratorConnector( (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521963) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521963) gpus = _sanitize_gpu_ids(gpus) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521963) raise MisconfigurationException( (pid=1521963) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521963) But your machine only has: [0] (pid=1521963) Exception in thread Thread-2: (pid=1521963) Traceback (most recent call last): (pid=1521963) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521963) self.run() (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521963) raise e (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521963) self._entrypoint() (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521963) return self._trainable_func(self.config, self._status_reporter, (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521963) output = fn() (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521963) inner(config, checkpoint_dir=None) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521963) fn(config, **fn_kwargs) (pid=1521963) File "train.py", line 122, in train (pid=1521963) trainer = pl.Trainer( (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521963) return fn(self, **kwargs) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521963) self.accelerator_connector = AcceleratorConnector( (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521963) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521963) gpus = _sanitize_gpu_ids(gpus) (pid=1521963) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521963) raise MisconfigurationException( (pid=1521963) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521963) But your machine only has: [0] 2021-04-06 13:13:53,593 ERROR trial_runner.py:616 -- Trial _inner_74d84_00001: Error processing event. Traceback (most recent call last): File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial results = self.trial_executor.fetch_result(trial) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/home/griffin/cl/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper return func(*args, **kwargs) File "/home/griffin/cl/lib/python3.8/site-packages/ray/worker.py", line 1456, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=1521963, ip=192.168.1.171) File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/trainable.py", line 167, in train_buffered result = self.train() File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/trainable.py", line 226, in train result = self.step() File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 366, in step self._report_thread_runner_error(block=True) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 512, in _report_thread_runner_error raise TuneError( ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=1521963, ip=192.168.1.171) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run self._entrypoint() File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint return self._trainable_func(self.config, self._status_reporter, File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func output = fn() File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner inner(config, checkpoint_dir=None) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner fn(config, **fn_kwargs) File "train.py", line 122, in train trainer = pl.Trainer( File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults return fn(self, **kwargs) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init self.accelerator_connector = AcceleratorConnector( File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids gpus = _sanitize_gpu_ids(gpus) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids raise MisconfigurationException( pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] But your machine only has: [0] Result for _inner_74d84_00001: {}

== Status == Memory usage on this node: 55.2/125.7 GiB Using AsyncHyperBand: num_stopped=0 Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None Resources requested: 0/64 CPUs, 0.0/2 GPUs, 0.0/48.14 GiB heap, 0.0/16.6 GiB objects (0/1.0 accelerator_type:RTX) Result logdir: /home/griffin/ray_results/tune_sent_ent_aligner Number of trials: 9/10 (1 ERROR, 8 TERMINATED) +--------------------+------------+-------+--------------+-------------+-------------+-------------+--------------+ | Trial name | status | loc | output_dim | embed_dim | cui_lr | sent_lr | batch_size | |--------------------+------------+-------+--------------+-------------+-------------+-------------+--------------| | _inner_74d84_00000 | TERMINATED | | 200 | 100 | 0.00391355 | 7.33826e-05 | 32 | | _inner_74d84_00002 | TERMINATED | | 200 | 300 | 0.00441475 | 8.50101e-05 | 64 | | _inner_74d84_00003 | TERMINATED | | 100 | 100 | 0.00109806 | 2.83367e-05 | 32 | | _inner_74d84_00004 | TERMINATED | | 200 | 100 | 0.00156456 | 3.42505e-05 | 64 | | _inner_74d84_00005 | TERMINATED | | 200 | 300 | 0.000504881 | 1.97323e-05 | 128 | | _inner_74d84_00006 | TERMINATED | | 100 | 200 | 0.00469312 | 5.05941e-05 | 32 | | _inner_74d84_00007 | TERMINATED | | 100 | 100 | 0.00248153 | 1.56615e-05 | 128 | | _inner_74d84_00008 | TERMINATED | | 100 | 300 | 0.0022021 | 4.0612e-05 | 64 | | _inner_74d84_00001 | ERROR | | 300 | 300 | 0.00263789 | 1.65288e-05 | 128 | +--------------------+------------+-------+--------------+-------------+-------------+-------------+--------------+ Number of errored trials: 1 +--------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |--------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | _inner_74d84_00001 | 1 | /home/griffin/ray_results/tune_sent_ent_aligner/_inner_74d84_00001_1_batch_size=128,cui_lr=0.0026379,embed_dim=300,output_dim=300,sent_lr=1.6529e-05_2021-04-06_13-13-50/error.txt | +--------------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

(pid=1521992) Loading train dataset... (pid=1521992) Loading validation dataset... (pid=1521992) 2021-04-06 13:13:53,622 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=1521992) Traceback (most recent call last): (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521992) self._entrypoint() (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521992) return self._trainable_func(self.config, self._status_reporter, (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521992) output = fn() (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521992) inner(config, checkpoint_dir=None) (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521992) fn(config, **fn_kwargs) (pid=1521992) File "train.py", line 122, in train (pid=1521992) trainer = pl.Trainer( (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521992) return fn(self, **kwargs) (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521992) self.accelerator_connector = AcceleratorConnector( (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521992) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521992) gpus = _sanitize_gpu_ids(gpus) (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521992) raise MisconfigurationException( (pid=1521992) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521992) But your machine only has: [0] (pid=1521992) Exception in thread Thread-2: (pid=1521992) Traceback (most recent call last): (pid=1521992) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner (pid=1521992) self.run() (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run (pid=1521992) raise e (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run (pid=1521992) self._entrypoint() (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint (pid=1521992) return self._trainable_func(self.config, self._status_reporter, (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func (pid=1521992) output = fn() (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 651, in _inner (pid=1521992) inner(config, checkpoint_dir=None) (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in inner (pid=1521992) fn(config, **fn_kwargs) (pid=1521992) File "train.py", line 122, in train (pid=1521992) trainer = pl.Trainer( (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults (pid=1521992) return fn(self, **kwargs) (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 319, in init (pid=1521992) self.accelerator_connector = AcceleratorConnector( (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 119, in init (pid=1521992) self.parallel_device_ids = device_parser.parse_gpu_ids(self.gpus) (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 81, in parse_gpu_ids (pid=1521992) gpus = _sanitize_gpu_ids(gpus) (pid=1521992) File "/home/griffin/cl/lib/python3.8/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=1521992) raise MisconfigurationException( (pid=1521992) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0, 1] (pid=1521992) But your machine only has: [0] Traceback (most recent call last): File "train.py", line 191, in tune_hparams(args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus) File "train.py", line 45, in tune_hparams analysis = tune.run( File "/home/griffin/cl/lib/python3.8/site-packages/ray/tune/tune.py", line 444, in run raise TuneError("Trials did not complete", incomplete_trials) ray.tune.error.TuneError: ('Trials did not complete', [_inner_74d84_00001]) [6]+ Killed python train.py --experiment debug`

griff4692 avatar Apr 06 '21 17:04 griff4692

Could you also share your code too please? Thanks!

amogkam avatar Apr 06 '21 17:04 amogkam

`import os import pickle

import argparse import pytorch_lightning as pl from pytorch_lightning import loggers as pl_loggers from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint from ray import tune from ray.tune import CLIReporter from ray.tune.schedulers import ASHAScheduler, PopulationBasedTraining from ray.tune.integration.pytorch_lightning import TuneReportCallback, TuneReportCheckpointCallback from ray_lightning import RayPlugin import torch from transformers import AutoTokenizer

from preprocess.shared.shared_constants import BASE_DIR from ents.cui_vocab import Vocab from ents.data_utils import EntDataModule from ents.models.ent_sent_aligner import EntSentAligner from ents.models.huggingface_constants import BERT, RoBERTa

def tune_hparams(args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus, num_epochs=4): assert args.sent_encoder == 'bert' assert args.cui_encoder == 'lstm' config = { 'embed_dim': tune.choice([100, 200, 300]), 'output_dim': tune.choice([100, 200, 300]), 'sent_lr': tune.loguniform(1e-5, 1e-4), 'cui_lr': tune.loguniform(5e-4, 5e-3), 'batch_size': tune.choice([32, 64, 128]), }

scheduler = ASHAScheduler(
    max_t=10,
    grace_period=1,
    reduction_factor=2
)

reporter = CLIReporter(
    parameter_columns=['output_dim', 'embed_dim', 'cui_lr', 'sent_lr', 'batch_size'],
    metric_columns=['val_loss', 'val_mrr', 'training_iteration'])

analysis = tune.run(
    tune.with_parameters(
        train,
        args=args,
        cui_vocab=cui_vocab,
        tokenizer=tokenizer,
        data_dir=data_dir,
        num_workers=num_workers,
        num_gpus=num_gpus,
        num_epochs=num_epochs,
        is_tune=True,
    ),
    resources_per_trial={
        'cpu': 8,
        'gpu': 0.25
    },
    metric='val_mrr',
    mode='max',
    config=config,
    num_samples=10,
    scheduler=scheduler,
    progress_reporter=reporter,
    name='tune_sent_ent_aligner',
    # https://wood-b.github.io/post/a-novices-guide-to-hyperparameter-optimization-at-scale/#how-to-select-an-hpo-strategy
    checkpoint_at_end=False,
    sync_on_checkpoint=False,
    fail_fast=True,
    # reuse_actors=True  # Requires a reset_config function
)

print('Best hyperparameters found were: ', analysis.best_config)

def train_w_args(args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus): num_epochs = 100 if args.debug else 10 save_dir = os.path.expanduser('~/weights/ent_pretrain') logger = pl_loggers.WandbLogger( name=args.experiment, save_dir=save_dir, offline=args.debug or args.mini, project='entity-pretrain', entity='clinsum' )

logger.log_hyperparams(args)
train({}, args,  cui_vocab, tokenizer, data_dir, num_workers, num_gpus, num_epochs, logger=logger, is_tune=False)

def train(config, args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus, num_epochs, logger=None, is_tune=False): for k, v in config.items(): prev_v = getattr(args, k) if v != prev_v: setattr(args, k, v) print(f'Changing {k} from {prev_v} to {v}')

pin_memory = num_gpus is not None and not args.debug
precision = 16 if num_gpus is not None else 32
model = EntSentAligner(args, cui_vocab, tokenizer)
dataset = EntDataModule(
    args, data_dir, cui_vocab, tokenizer, model.prepare_batch, num_workers=num_workers, pin_memory=pin_memory)

checkpoint_callback = ModelCheckpoint(
    monitor='val_loss',
    save_top_k=1,
    save_last=True,
    mode='min'
)

early_stopping = EarlyStopping('val_loss')
callbacks = [early_stopping, checkpoint_callback]
plugins = None
if is_tune:
    tune_callback = TuneReportCallback({'val_loss': 'val_loss', 'val_mrr': 'val_mrr'}, on='validation_end')
    callbacks.append(tune_callback)
    plugins = [RayPlugin(num_workers=num_workers, use_gpu=True)]

experiment_dir = os.path.expanduser(os.path.join('~/weights/ents', args.experiment))
trainer = pl.Trainer(
    logger=logger,
    callbacks=None if args.debug or args.mini else callbacks,
    precision=precision,
    gpus=num_gpus,
    accelerator=None if num_gpus is None or num_gpus == 1 else 'ddp',
    terminate_on_nan=True,
    val_check_interval=1.0 if args.debug else 0.25,
    default_root_dir=experiment_dir,
    plugins=plugins,
    max_epochs=num_epochs,
    overfit_batches=1 if args.debug else 0.0,
    # gradient_clip_val=0,
    # https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html
    # stochastic_weight_avg=True
)
print('Starting training...')
trainer.fit(model, dataset)

if name == 'main': parser = argparse.ArgumentParser('Train Sent-Aligner') parser.add_argument('--batch_size', default=128, type=int) parser.add_argument('--cui_lr', default=1e-3, type=float) parser.add_argument('--sent_lr', default=3e-5, type=float, help='Should be lower than CUI LR when using a pre-trained transformer') parser.add_argument('--output_dim', default=200, type=int) parser.add_argument('--embed_dim', default=100, type=int) parser.add_argument('-tune', default=False, action='store_true') parser.add_argument('--experiment', default='default') parser.add_argument('-debug', action='store_true', default=False) parser.add_argument('-mini', action='store_true', default=False) parser.add_argument('-dsum_only', action='store_true', default=True) parser.add_argument('--sent_encoder', default='bert', choices=['lstm', 'roberta', 'bert']) parser.add_argument('--cui_encoder', default='lstm', choices=['lstm', 'transformer']) parser.add_argument('--max_cui_seq_len', default=512, choices=[128, 256, 512]) parser.add_argument('--objective', default='s|e', choices=['s|e', 'e|s']) parser.add_argument('-cpu', default=False, action='store_true') parser.add_argument('--cui_transformer_layers', default=4, type=int, help='Used if --cui_encoder=transformer') parser.add_argument('--cui_att_heads', default=4, type=int, help='Used if --cui_encoder=transformer')

args = parser.parse_args()
if args.mini:
    args.experiment = 'default'
if args.debug:
    args.mini = True

if args.sent_encoder == 'roberta':
    raise Exception('RoBERTa currently not supported.')

cui_vocab_fn = os.path.join(BASE_DIR, 'ent_pretrain', 'cui_vocab.pk')
model_str = BERT if args.sent_encoder == 'bert' else RoBERTa
tokenizer = AutoTokenizer.from_pretrained(model_str)

with open(cui_vocab_fn, 'rb') as fd:
    cui_vocab = pickle.load(fd)

cui_v = len(cui_vocab)
print('Loaded CUI vocab of size={}'.format(cui_v))

data_dir = os.path.join(BASE_DIR, 'ent_pretrain')
num_gpus = torch.cuda.device_count() if torch.cuda.is_available() and not args.cpu else None
if num_gpus is not None and args.debug:
    num_gpus = 1
num_workers = round(os.cpu_count() * 1.0)

print('Num GPUs --> {}'.format(num_gpus))
print('Num workers --> {}'.format(num_workers))
if args.tune:
    tune_hparams(args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus)
    exit(0)

train_w_args(args, cui_vocab, tokenizer, data_dir, num_workers, num_gpus)

`

griff4692 avatar Apr 06 '21 17:04 griff4692

the output of

list(range(torch.cuda.device_count()))

is [0, 1] so I'm confused as to how it comes up [0] when I use the ray-lightning code

griff4692 avatar Apr 06 '21 17:04 griff4692

Ah so the problem is that the gpus passed into pl.Trainer is not the same as the gpus set in resources_per_trial. Can you change the gpus passed into pl.Trainer to be 1? That will fix this issue, but your code still won't work fully. To fix new the one that comes up, you have to modify your resources_per_trial to be like what's in the examples:

resources_per_trial={
            "cpu": 1,
            "gpu": int(use_gpu),
            "extra_cpu": num_workers,
            "extra_gpu": num_workers * int(use_gpu)
        }

Note that this will result in 1 GPU per trial not being utilized. I am currently working on a fix for this, but if you want a hacky way to avoid this, you can follow the thread here https://github.com/ray-project/ray_lightning/issues/23.

amogkam avatar Apr 06 '21 17:04 amogkam

Thanks - I changed everything to 1

    resources_per_trial={
        'cpu': 1,
        'gpu': 1,
        'extra_cpu': 1,
        'extra_gpu': 1
    },

and num_gpus to pl.Trainer to "1" as well and am getting...

(pid=1545888) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] 2021-04-06 14:37:01,746 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffffaca3ed3a841b06585769c4b801000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {62.000000/64.000000 CPU, 31.933594 GiB/31.933594 GiB memory, 0.000000/2.000000 GPU, 1.000000/1.000000 node:192.168.1.171, 10.986328 GiB/10.986328 GiB object_store_memory, 1.000000/1.000000 accelerator_type:RTX} . In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

griff4692 avatar Apr 06 '21 18:04 griff4692

How many workers are you using (num_workers passed into RayPlugin)? In a 2 GPU machine, with 1 GPU not being utilized (see https://github.com/ray-project/ray_lightning/issues/32#issuecomment-814313347), means that you can run a maximum of 1 worker.

amogkam avatar Apr 06 '21 18:04 amogkam

I am using 1 worker as well

griff4692 avatar Apr 06 '21 18:04 griff4692

Looking at your code it seems like num_workers is being set to the CPU count, is that correct? num_workers = round(os.cpu_count() * 1.0). Wouldn't that set num_workers to 64?

amogkam avatar Apr 06 '21 18:04 amogkam

Sorry I removed that code. so now num-workers is set to 1. It is actually working now, but I believe only running a single task at a time and using a single GPU. what is the advantage then over not using this Plugin?

griff4692 avatar Apr 06 '21 18:04 griff4692

Great, glad it's working now!

So this plugin is useful for general-purpose distributed training with Pytorch Lightning on either single-node or a large multi-node cluster. When used with Tune, it also allows each Tune trial to be run in a distributed fashion.

I am currently working on a fix for the unutilized GPU issue, but if you would like, you can follow the thread in #23 for a hacky fix. With that fix, you would be able to run Tune with each trial using 2 workers, which is not possible without this plugin. Or even without that fix, you could use more workers if you have more GPUs. But you're right, as it stands, when used with Tune with just 1 worker, there's no significant benefit of using this plugin over standard Tune+Pytorch Lightning

amogkam avatar Apr 06 '21 19:04 amogkam