Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] Do we really need to call np.arange every time we restart the task?

Open zyksir opened this issue 1 year ago • 0 comments

When we first launch the task, we will build the index for the dataset. Every time we restart the task, we will just load the idx file and npy file

I notice in function _build_megatron_dataset_splits, Megatron-LM will call numpy.arange every time. This piece of code can be cpu bound and lead to a slow initialize time. I don't see why we need to call numpy.arange every time. It seems that the indices will be used only when we build the index in the first run.

zyksir avatar Sep 26 '24 04:09 zyksir