chongxiaoc comments

Results 37 comments of


                                            chongxiaoc

Number of epochs division on number of GPU

``` Now the concern is how to feed the validation dataset after each training epoch and which GPU will process the validation dataset if multiple GPUs are available. ``` Validation...

Number of epochs division on number of GPU

`steps_per_epoch` is a pamater of keras fit function to tell how many batches to use per epoch on single GPU. Users use this paramter to explicitly force model to be...

Number of epochs division on number of GPU

> It means it is not compulsory to mention, right? Right. But users have to make every GPU has equivalent data to train per epoch (depends on data distribution). Horovod...

[Horovod on Spark] Error after finishing last step of last epoch (Caught signal 11 - Segment fault: object not mapped to address)

Do you have a reproducer, any existing Spark examples fail on your side?

Refactor inmem cache out of BatchedDataLoader, create an inmem dataloader instead?

Great, I will go ahead to draft a PR.

Reader: shuffle row groups

Attach some benchmark result using synthetic dataset with PyTorch: - Experiment setup: `BatchedDataLoader + make_batch_reader()`, batch_size=50000, shuffle_buffer_size=1000000, thread_pool, 10 workers. - Datasets from 100M rows to 1.6B rows are tested....

Implementing Asynchronous Data Shuffling Part

Had a discussion offline, start to work on moving the bottleneck `data shuffling part` into an asynchronous thread. https://github.com/uber/petastorm/blob/cf1159dc04416ed737eec25bcecef5d5aafa805a/petastorm/pytorch.py#L373 And see how it can hide/overlap the latency.

chongxiaoc

Number of epochs division on number of GPU

Number of epochs division on number of GPU

Number of epochs division on number of GPU

[Horovod on Spark] Error after finishing last step of last epoch (Caught signal 11 - Segment fault: object not mapped to address)

Refactor inmem cache out of BatchedDataLoader, create an inmem dataloader instead?

Reader: shuffle row groups

Implementing Asynchronous Data Shuffling Part

Enable batch fetching in parallel

[CI] Upgrade test frameworks 9

CI: Spark tests crash on GPU