Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] How to pre-build the dataset's index ?

Open etiennemlb opened this issue 10 months ago • 1 comments

How to pre-build the dataset's index ?

I want to avoid using compute node for this task:

> WARNING: could not find index map files, building the indices on rank 0 ...
> elasped time to build and save doc-idx mapping (seconds): 270.614145

etiennemlb avatar Apr 24 '24 13:04 etiennemlb

you can use --data-cache-path to specify where you want to cache. And precompute it using a single node.

https://github.com/NVIDIA/Megatron-LM/blob/9de386d08770d7296263a590171ace4ae45348ad/megatron/training/arguments.py#L1349-L1350

ethanhe42 avatar May 02 '24 20:05 ethanhe42

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jul 02 '24 18:07 github-actions[bot]