Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Calling IndexedDatasetBuilder directly with a best_fit datatype fails

Open adammoody opened this issue 2 years ago • 1 comments

I don't know whether this is intended to work or not, but I found the following program:

from megatron.data.indexed_dataset import IndexedDatasetBuilder, best_fitting_dtype

best_dtype = best_fitting_dtype(10_000)
IndexedDatasetBuilder("testfile", dtype=best_dtype)

leads to an error like:

  File "/path/to/Megatron-DeepSpeed.git/megatron/data/indexed_dataset.py", line 284, in __init__
    self.element_size = self.element_sizes[self.dtype]
KeyError: <class 'numpy.uint16'>

This shows up because best_fitting_dtype will return numpy.uint16 for small vocabs:

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c680954e0b13232abd5b72711f2032bea1ad65c9/megatron/data/indexed_dataset.py#L25-L27

but that particular type is missing from the element_sizes table.

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c680954e0b13232abd5b72711f2032bea1ad65c9/megatron/data/indexed_dataset.py#L268-L276

which is used in the IndexedDatasetBuilder constructor here:

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c680954e0b13232abd5b72711f2032bea1ad65c9/megatron/data/indexed_dataset.py#L284

Should something like this work?

If so, it seems like either the best_fitting_dtype function should return a different type or uint16 should be added to the element_sizes table like https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/55/commits/f706108acadb2d97dc77f066930accfbdde4942f.

adammoody avatar Aug 17 '21 18:08 adammoody

Good catch!

To automatically generate all the size for supported dtype we could have something like:

{ key: np.dtype(key).itemsize for key in dtypes.values()}

Notice also that np.float is supposed to be 8, as it's an alias for python float which is a C double ... Why that is I don't know: https://docs.python.org/3.8/library/stdtypes.html#:~:text=Floating%20point%20numbers%20are%20usually%20implemented%20using%20double%20in%20C%3B

We could even push best_fitting_dtype to go all the way to uint8 and to int64, but those are extreme cases and I don't think we're interested in them.

My guess is that IndexedDataset was never used in favor of mmap version (Even in the official repo). We should probably run this upstream also.

thomasw21 avatar Aug 18 '21 09:08 thomasw21