streaming icon indicating copy to clipboard operation
streaming copied to clipboard

example requested: debug individual shard for poison pills

Open mooreniemi opened this issue 1 year ago • 5 comments

I sometimes hit "poison pills" inside an MDS dataset. Is there documentation on how to load just one shard and traverse to the pill, without the load mechanisms that might throw on error?

Eg. I get an error like:

IndexError: Relative sample index 85 is not present in the 17/shard.00085.mds file.

I'd like to be able to easily "peek" at this data without having to open the entire dataset. (I'm also not clear how to translate that error to the absolute position of my sample in the entire dataset.)

Sorry if I missed this in existing docs.

mooreniemi avatar Aug 25 '23 19:08 mooreniemi

Oh, no!

A serialized streaming dataset consists of two things, which must be in agreement:

  1. an index.json file which contains essential metadata about each shard (including how many samples they contain, which is used to determine the global sample ID space), and
  2. each shard (e.g., 00000.mds).

If you encounter the error Relative sample index ... is not present, at a low level this means a sample read did a seek off the end of a shard file, to its surprise. At a high level, this means unfortunately the dataset is inconsistent (invalid).

There are several possible reasons for this:

  1. The header data of the shard became corrupted
  2. The shard file was truncated
  3. The shard file was completely overwritten, but the corresponding metadata entry in the dataset's index was not updated -- say, you started writing a streaming dataset to the exact same path as an existing streaming dataset, but this job did not complete fully -- perhaps you were saving to remote storage, and the job encountered network errors.
  4. (Less commonly) mistakes while manually creating or recovering an index.json file for some shards.

How to identify dataset corruption/peek at data:

  1. The index.json, because it contains metadata about all the shards, is always the last file to be written. If any shard file has a timestamp more recent than the index, the dataset must have problems.
  2. The index.json shard metadata contains the size in bytes of every file. Verify that file sizes match the expected.
  3. Optionally, shards can be hashed when written, and the resulting checksums saved in the shard metadata in the index.json file. Verify that file hashes match the expected.

Example jq commands:

cat index.json | jq .shards[].raw_data.basename
cat index.json | jq .shards[].raw_data.samples
cat index.json | jq .shards[].zip_data.hashes.sha1

knighton avatar Aug 25 '23 20:08 knighton

Hmm, none of the possibilities you list out immediately jump out to me but I will look deeper. (I'll check the size vs index recorded size for inconsistency; the shard itself is consistent in size with the rest.) To write the MDS data, I follow the official example except writing it directly to S3 (via Ray). Perhaps the write to S3 failed somehow. I can look into this further.

I actually also have cases where I see invalid json decode errors, so would you mind providing examples that answer the other spirit of the question I asked, even though it is not quite the question that needed to be answered wrt IndexError specifically? (Thanks for answering that one so thoroughly!)

That is, say I know shard n contains bad sample at x (or around x), how can I peek at x?

mooreniemi avatar Aug 25 '23 20:08 mooreniemi

Hey, sorry I missed this.

Perhaps the write to S3 failed somehow

Yeah, my best guess is there may have been an exception during multi-threaded shard uploading (during writing MDS dataset to S3) that wasn't propagated properly or something of that nature.

I actually also have cases where I see invalid json decode errors

Oof. Where are you getting invalid JSON decode errors?

Places where Streaming uses JSON:

  • index.json is in JSON
  • JSONL shards (JSONL file paired with a corresponding index file containing the offset of each line in bytes in the JSONL for instant sample lookup)
  • The fields of JSONL shards
  • json-encoded fields of MDS shards
  • MDS is a binary serialization format that contains some critical metadata needed for recovering a lost index.json within itself as a flat JSON dict

That is, say I know shard n contains bad sample at x (or around x), how can I peek at x?

shard_id = (shard n)
shard_sample_id = (sample x of shard n)
dataset = StreamingDataset(remote='s3://path/to/remote', local='/path/to/local', ...)
sample_id = dataset.sample_offset_per_shard[shard_id] + shard_sample_id
sample = dataset[sample_id]  # -> Dict[str, Any]

knighton avatar Aug 30 '23 00:08 knighton

If this is helpful for debugging, note that an index.json file is just a list of dicts, where each dict is the metadata for a shard like filename and how many samples it has. You can add shards to it, you can remove all the shards but the one you are interested in, etc, and the resulting dataset will reflect whatever the index.json says it is.

knighton avatar Aug 30 '23 00:08 knighton

Hi @mooreniemi, Did @knighton suggestions helped you debugging your issue?

karan6181 avatar Sep 11 '23 15:09 karan6181

Closing out this issue as it has been inactive for a while.

snarayan21 avatar May 29 '24 19:05 snarayan21