pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Does `Trainer(devices=1)` use all CPUs?

Open maximilienleclei opened this issue 1 year ago • 6 comments

Bug description

https://github.com/Lightning-AI/pytorch-lightning/blob/3740546899aedad77c80db6b57f194e68c455e28/src/lightning/fabric/accelerators/cpu.py#L75

cpu_cores being a list of integers will always raise an exception, which shouldn't according to the Trainer documentation/this function signature

What version are you seeing the problem on?

master

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @borda

maximilienleclei avatar Mar 07 '24 21:03 maximilienleclei

@MaximilienLC Do you mean this?

trainer = Trainer(
      accelerator="cpu",
      devices=[1, 3],
  )
TypeError: `devices` selected with `CPUAccelerator` should be an int > 0.

This is correct. It does not make sense to select device indices on a CPU. Device indices are meant for accelerators like CUDA GPUs or TPUs.

If you select devices=1 (int) on CPU, PyTorch will already use all cores and threads when appropriate. And devices > 1 just simulates DDP on CPU but it does not give you any benefits in terms of speed, it's only meant for debugging DDP.

awaelchli avatar Mar 08 '24 00:03 awaelchli

If there is documentation that contradicts this, please point me to it so we can update it. Thanks!

awaelchli avatar Mar 08 '24 00:03 awaelchli

https://github.com/Lightning-AI/pytorch-lightning/blob/3740546899aedad77c80db6b57f194e68c455e28/src/lightning/pytorch/trainer/trainer.py#L146

I guess this section could add that info for CPU.

So you mean devices=1 on CPU equates to using all CPUs?

maximilienleclei avatar Mar 08 '24 00:03 maximilienleclei

So you mean devices=1 on CPU equates to using all CPUs?

Yes PyTorch will use all CPUs if it can parallelize the operation accordingly.

awaelchli avatar Mar 08 '24 00:03 awaelchli

I agree that this should not throw an exception assuming the current documentation is correct. For example, the device could come from an environment variable

trainer = Trainer(
    accelerator=os.environ.get("DEVICE", "cpu"),
    devices=-1,  # use all devices
)

So when deploying the code to different machines, it would break for CPU. Maybe a warning would be more adequate to notify the user that something is not optimally configured.

Would devices="auto" also use all devices?

jneuendorf avatar Mar 26 '24 23:03 jneuendorf

Hi! Would like to work on this one.

@awaelchli, wanted to clarify, what is more logical here: change the exception in the case accelerator="cpu", so that it is not thrown for devices equal to -1 (as pointed above), or the current behavior is correct and we only need to add more info to the devices section in the trainer.py docstring?

fisaenko avatar Aug 22 '24 14:08 fisaenko

Humbly I think the solution should be to remove List[int] as an input type for cpu accelerator, @awaelchli please review this PR!

chualanagit avatar Nov 06 '24 07:11 chualanagit