Light-V

Results 2 issues of Light-V

When using distributed training, the process with local_rank!=0 will not call torch.distributed.barrier() and cause a deadlock.

I've made a small optimization in data loading process that reduces peak memory usage when handling large datasets. Previously, we were using np.array() to convert large datasets from h5py objects...