matthewdeng

Results 23 comments of matthewdeng

The `invalid_row_handler` issue should now also be resolved in master with https://github.com/ray-project/ray/issues/28326, although PyArrow 7.0 (which is when `invalid_row_handler` was introduced) is not compatible with Datasets.

@pcmoritz the issue is being tracked in https://github.com/ray-project/ray/issues/22310, @clarkzinzow can shed more light on the forward fix for PyArrow 10.

``` Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 840, in _wait_and_handle_event trial, result[_ExecutorEvent.KEY_FUTURE_RESULT] File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 964, in _on_training_result self._process_trial_results(trial, result) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1048, in _process_trial_results decision =...

Tensorflow prefetching may be sufficient.

This seems to be an issue with PyTorch/Windows and not Ray.

@showkeyjar do you have a repro for this? How much training data are you loading and how much disk space are you seeing consumed?

Hello @pacman100 @SunMarc could you review this issue? Thanks so much!

Oops, sorry for including that part. The same behavior can be seen with `torchrun`. **`script.py`:** ```python import torch.distributed from transformers import AutoModel, TrainingArguments torch.distributed.init_process_group(backend="nccl") deepspeed_config = { "zero_optimization": { "stage":...

@davzaman yeah that sounds great! Closing this PR and creating a new one with the remaining changes makes sense to me.