streaming example requested: debug individual shard for poison pills

I sometimes hit "poison pills" inside an MDS dataset. Is there documentation on how to load just one shard and traverse to the pill, without the load mechanisms that might throw on error?

Eg. I get an error like:

IndexError: Relative sample index 85 is not present in the 17/shard.00085.mds file.

I'd like to be able to easily "peek" at this data without having to open the entire dataset. (I'm also not clear how to translate that error to the absolute position of my sample in the entire dataset.)

Sorry if I missed this in existing docs.

Aug 25 '23 19:08 mooreniemi

Oh, no!

A serialized streaming dataset consists of two things, which must be in agreement:

an index.json file which contains essential metadata about each shard (including how many samples they contain, which is used to determine the global sample ID space), and
each shard (e.g., 00000.mds).

If you encounter the error Relative sample index ... is not present, at a low level this means a sample read did a seek off the end of a shard file, to its surprise. At a high level, this means unfortunately the dataset is inconsistent (invalid).

There are several possible reasons for this:

The header data of the shard became corrupted
The shard file was truncated
The shard file was completely overwritten, but the corresponding metadata entry in the dataset's index was not updated -- say, you started writing a streaming dataset to the exact same path as an existing streaming dataset, but this job did not complete fully -- perhaps you were saving to remote storage, and the job encountered network errors.
(Less commonly) mistakes while manually creating or recovering an index.json file for some shards.

How to identify dataset corruption/peek at data:

The index.json, because it contains metadata about all the shards, is always the last file to be written. If any shard file has a timestamp more recent than the index, the dataset must have problems.
The index.json shard metadata contains the size in bytes of every file. Verify that file sizes match the expected.
Optionally, shards can be hashed when written, and the resulting checksums saved in the shard metadata in the index.json file. Verify that file hashes match the expected.

Example jq commands:

cat index.json | jq .shards[].raw_data.basename
cat index.json | jq .shards[].raw_data.samples
cat index.json | jq .shards[].zip_data.hashes.sha1

Aug 25 '23 20:08 knighton

Hmm, none of the possibilities you list out immediately jump out to me but I will look deeper. (I'll check the size vs index recorded size for inconsistency; the shard itself is consistent in size with the rest.) To write the MDS data, I follow the official example except writing it directly to S3 (via Ray). Perhaps the write to S3 failed somehow. I can look into this further.

I actually also have cases where I see invalid json decode errors, so would you mind providing examples that answer the other spirit of the question I asked, even though it is not quite the question that needed to be answered wrt IndexError specifically? (Thanks for answering that one so thoroughly!)

That is, say I know shard n contains bad sample at x (or around x), how can I peek at x?

Aug 25 '23 20:08 mooreniemi

Hey, sorry I missed this.

Perhaps the write to S3 failed somehow

Yeah, my best guess is there may have been an exception during multi-threaded shard uploading (during writing MDS dataset to S3) that wasn't propagated properly or something of that nature.

I actually also have cases where I see invalid json decode errors

Oof. Where are you getting invalid JSON decode errors?

Places where Streaming uses JSON:

index.json is in JSON
JSONL shards (JSONL file paired with a corresponding index file containing the offset of each line in bytes in the JSONL for instant sample lookup)
The fields of JSONL shards
json-encoded fields of MDS shards
MDS is a binary serialization format that contains some critical metadata needed for recovering a lost index.json within itself as a flat JSON dict

That is, say I know shard n contains bad sample at x (or around x), how can I peek at x?

shard_id = (shard n)
shard_sample_id = (sample x of shard n)
dataset = StreamingDataset(remote='s3://path/to/remote', local='/path/to/local', ...)
sample_id = dataset.sample_offset_per_shard[shard_id] + shard_sample_id
sample = dataset[sample_id]  # -> Dict[str, Any]

Aug 30 '23 00:08 knighton

If this is helpful for debugging, note that an index.json file is just a list of dicts, where each dict is the metadata for a shard like filename and how many samples it has. You can add shards to it, you can remove all the shards but the one you are interested in, etc, and the resulting dataset will reflect whatever the index.json says it is.

Aug 30 '23 00:08 knighton

Hi @mooreniemi, Did @knighton suggestions helped you debugging your issue?

Sep 11 '23 15:09 karan6181

Closing out this issue as it has been inactive for a while.

May 29 '24 19:05 snarayan21

streaming streaming copied to clipboard

example requested: debug individual shard for poison pills

streaming
streaming copied to clipboard