streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Last entry in the dataset is causing "Relative sample index $x is not present" error

Open isidentical opened this issue 9 months ago • 3 comments

Environment

  • OS: [Ubuntu 20.04]
  • Hardware (GPU, or instance type): [H100]

When I try to load a big dataset with ~thousands of shards (each shard is ~1GB), on some of those shards I get the following error:

[rank5]:   File "/home/ubuntu/xxx/.venv/lib/python3.10/site-packages/streaming/base/array.py", line 90, in __getitem__
[rank5]:     return self.get_item(at)
[rank5]:   File "/home/ubuntu/xxx/.venv/lib/python3.10/site-packages/streaming/base/dataset.py", line 1235, in get_item
[rank5]:     sample = shard[shard_sample_id]
[rank5]:   File "/home/ubuntu/xxx/.venv/lib/python3.10/site-packages/streaming/base/array.py", line 90, in __getitem__
[rank5]:     return self.get_item(at)
[rank5]:   File "/home/ubuntu/xxx/.venv/lib/python3.10/site-packages/streaming/base/format/base/reader.py", line 319, in get_item
[rank5]:     data = self.get_sample_data(idx)
[rank5]:   File "/home/ubuntu/xxx/.venv/lib/python3.10/site-packages/streaming/base/format/mds/reader.py", line 145, in get_sample_data
[rank5]:     raise IndexError(
[rank5]: IndexError: Relative sample index 5205 is not present in the shard.01000.mds file.

But when looking into the actual data files, the shards themselves seem correct (~1 Gig each, where everything can be indexed properly except the last item). Here is that shard's index:

{'column_encodings': ['str', 'jpeg', 'str', 'np16', 'uint8', 'np16'],
 'column_names': ['caption',
                  'image',
                  'key',
                  'sscd_embeddings',
                  't5_xl_embeddings',
                  'vae_256x256_latents'],
 'column_sizes': [None, None, None, None, None, None],
 'compression': None,
 'format': 'mds',
 'hashes': [],
 'raw_data': {'basename': 'shard.01000.mds', 'bytes': 1073575555, 'hashes': {}},
 'samples': 5206,
 'size_limit': 1073741824,
 'version': 2,
 'zip_data': None}

As you can see, the index says there are 5206 samples. Which makes the sample index 5205 the last item. When I read the sample index manually, I see the following values:

>>> filename = "shard.01000.mds"
>>> offset = (1 + 5205) * 4
>>> with open(filename, 'rb', 0) as fp:
...     fp.seek(offset)
...     pair = fp.read(8)
...     begin, end = np.frombuffer(pair, np.uint32)
>>> begin, end = np.frombuffer(pair, np.uint32)
1073575555
>>> end
1868767867
>>> end - begin
795192312 (invalid value)

The problem is there is nothing after 1073575555:

>>> with open(filename, 'rb', 0) as fp:
...     fp.seek(1073575555)
...     data = fp.read()
... 
1073575555
>>> data
b''

I am assuming this happened because the sample didn't fit to the size limit but still got counted towards this index (since size_limit - 1073575555 is too smoll to fit anything), somehow? In either case, this seems to be made the dataset unusable. Will try to manually fix the index but just making you aware this is a problem.

isidentical avatar May 20 '24 02:05 isidentical