Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
Calling IndexedDatasetBuilder directly with a best_fit datatype fails
I don't know whether this is intended to work or not, but I found the following program:
from megatron.data.indexed_dataset import IndexedDatasetBuilder, best_fitting_dtype
best_dtype = best_fitting_dtype(10_000)
IndexedDatasetBuilder("testfile", dtype=best_dtype)
leads to an error like:
File "/path/to/Megatron-DeepSpeed.git/megatron/data/indexed_dataset.py", line 284, in __init__
self.element_size = self.element_sizes[self.dtype]
KeyError: <class 'numpy.uint16'>
This shows up because best_fitting_dtype
will return numpy.uint16
for small vocabs:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c680954e0b13232abd5b72711f2032bea1ad65c9/megatron/data/indexed_dataset.py#L25-L27
but that particular type is missing from the element_sizes
table.
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c680954e0b13232abd5b72711f2032bea1ad65c9/megatron/data/indexed_dataset.py#L268-L276
which is used in the IndexedDatasetBuilder
constructor here:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c680954e0b13232abd5b72711f2032bea1ad65c9/megatron/data/indexed_dataset.py#L284
Should something like this work?
If so, it seems like either the best_fitting_dtype
function should return a different type or uint16
should be added to the element_sizes
table like https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/55/commits/f706108acadb2d97dc77f066930accfbdde4942f.
Good catch!
To automatically generate all the size for supported dtype we could have something like:
{ key: np.dtype(key).itemsize for key in dtypes.values()}
Notice also that np.float
is supposed to be 8, as it's an alias for python float
which is a C double
... Why that is I don't know:
https://docs.python.org/3.8/library/stdtypes.html#:~:text=Floating%20point%20numbers%20are%20usually%20implemented%20using%20double%20in%20C%3B
We could even push best_fitting_dtype
to go all the way to uint8
and to int64
, but those are extreme cases and I don't think we're interested in them.
My guess is that IndexedDataset was never used in favor of mmap version (Even in the official repo). We should probably run this upstream also.