Otter
Otter copied to clipboard
Is it correct to set up fsdp for a machine (V100) that does not support bf16?
compute_environment: LOCAL_MACHINE distributed_type: no downcast_bf16: false machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_use_cluster: false tpu_use_sudo: false use_cpu: false main_process_port: 20687
yes it seems correct!
OK,thank u,I also want to ask about the main thread memory is higher than other threads and overflow situation, how I should solve it, do you have suggestions?
---- Replied Message ---- | From | Li @.> | | Date | 09/14/2023 12:10 | | To | Luodian/Otter @.> | | Cc | xmc-andy @.>, Author @.> | | Subject | Re: [Luodian/Otter] Is it correct to set up fsdp for a machine (V100) that does not support bf16? (Issue #274) |
yes it seems correct!
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
I think you can refer to this link to see if you can do something.
https://github.com/huggingface/accelerate/blob/6b3e559926afc4b9a127eb7762fc523ea0ea656a/src/accelerate/big_modeling.py#L514
I know that you may able to set device_map=balanced_low_0
to decreased GPU usage on rank 0 (since rank0 will do gather operations and sometimes other params will be shifted to rank 0 so induce to OOM).
Previously I see some code doing so but I didnt use it before, maybe you should do some search on device_map
mechanism and how to set it. And we are welcome that you could update your experience to us to help more users tackle the problem on V100 GPU~
Thank u for your shared suggestions, I will try them,
I tried setting device_map to 'auto', 'balanced', 'balanced_low_0' or 'sequential' respectively. Unfortunately, it still overflows the memory on 3 V100s (unfrozen ViT). In comparison, I think balanced_low_0 is It might be possible if I have enough cards, I will try it further if I have 4 V100s.