DALI Is it possible to debug DALI pipeline?

Hi,

I have a DALI pipeline with a external source reader but it fails when reading some images. The error is related to NVJPEG decoding process and it is shown below:

Error when executing Mixed operator decoders__Image encountered: Error in thread 0: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_helper.h:228] [/opt/dali/dali/image/jpeg_mem.cc:176] Assert on "jpeg_start_decompress(&cinfo)" failed

I think there is some corrupted file causing this error but until now I'm not able to detect which file is causing it. I've run a script using verify method from PIL.Image module but it hasn't found any corrupted file. So, I would like to know if it is possible to use my current DALI pipeline to track with file is causing it.

The code snippets of my external source and pipeline are provided below and also a method to verify images in pipeline. I can not share dataset due to its size and confidentiality. The csv_file is a debug version where which line has the format "image_path;label" and label is an integer value representing a image from dataset.

class ExternalDebugInputIterator(object):
    def __init__(self, images_dir, batch_size, device_id, num_gpus):
        self.images_dir = images_dir
        self.batch_size = batch_size
        
        with open(self.images_dir, 'r') as file:
            self.files = [line.rstrip() for line in file if line != '' and line.split(';')[0]]
        
        self.data_set_len = len(self.files)
        self.files = self.files[self.data_set_len * device_id // num_gpus:
                                       self.data_set_len * (device_id + 1) // num_gpus]
        self.n = len(self.files)

    def __iter__(self):
        self.i = 0
        random.shuffle(self.files)
        return self

    def __next__(self):
        batch = []
        labels = []

        if self.i >= self.n:
            self.__iter__()
            raise StopIteration

        for _ in range(self.batch_size):
            jpeg_filename, label = self.files[self.i % self.n].split(';')
            batch.append(np.fromfile(jpeg_filename, dtype='uint8'))
            labels.append(np.array(label,dtype='int64'))
            self.i += 1
        return (batch, labels)

    def __len__(self):
        return self.data_set_len

    next = __next__


def ExternalDebugSourcePipeline(batch_size, num_threads, device_id, external_data):
    pipe = Pipeline(batch_size, num_threads, device_id)
    with pipe:
        jpegs, labels = fn.external_source(source=external_data, num_outputs=2)
        images = fn.decoders.image(jpegs, device="mixed")
        images = fn.resize(images, size=(224,224), device='gpu')

        images = fn.crop_mirror_normalize(images, std=255.0)
        images = fn.cast(images, dtype=types.FLOAT)
        labels = fn.cast(labels, dtype=types.INT64)
        pipe.set_outputs(images, labels)
    return pipe

def verify_dali_pipeline(csv_file):
    names = ['imgs','labels']
    print(f"Analysing {csv_file}\n")
    external_source = ExternalDebugInputIterator(csv_file, batch_size=1, device_id=0, num_gpus=1)
    pipe = ExternalDebugSourcePipeline(batch_size=1, #Using batch_size of 1 to verify which file is causing error 
							        num_threads=1,
								device_id=0,
								external_data=external_source)
    loader = DALIGenericIterator(pipe, output_map=names, last_batch_padded=True, last_batch_policy=LastBatchPolicy.PARTIAL)

    samples = external_source.n
    print(f"{csv_file} has {samples} samples")
    steps = np.ceil(samples / (1 * 1))

    t = tqdm(loader, unit='batch',total=steps)
    with open("error.log",'a') as log:
        for d in t:
            try:
                data, label = d[0]['imgs'],d[0]['labels']
            except Exception as e:
                print(f'{e}')
                print(f'{label}')
                log.write(f'{label}\n')

Thanks for your time and help!!

Apr 12 '22 13:04 rgsousa88

Hi @rgsousa88,

This error is coming from the libjpeg-turbo library. It is likely that the data arriving to the decoder is somehow corrupted. To narrow down the problem, I would try the following

To discard any issues with your custom data source, I'd try using fn.readers.file instead of external source, and going through the dataset. If this works, the issue might be with the data loader. Note: For this kind of usage (loading files and labels) you don't need the external source, fn.readers.file would be enough.
Feed the data to external source in a deterministic order, and count the iteration number when this issue happen. Does it happen always on the same image? Is it random? If it happens always on the same sample, modify the pipeline to output the raw jpeg instead of the decoded one and save it to a file again. Does it match the original file? Can you open this file with an image viewer?
Try fn.decoders.image(..., device='cpu') to rule out an issue in the mixed backend implementation. In any case, your error message comes from the libjpeg-turbo (not nvjpeg), which suggests that nvjpeg could not decode the image and we are falling back to a CPU decoder (libjpeg-turbo based) as a last resort.

Let me know if any of those ideas reveal more information. Thanks

Apr 12 '22 14:04 jantonguirao

Hi, @jantonguirao,

First of all, thank for your time and suggestions. The reason I'm using an ExternalSource as loader is due to my real csv_file (not debug version described in the post) is in a format that do not permit be used with fn.readers.file. This "real" annotation file is in the format: image_path;label_1 label_2 ... label_n. So, this is why I'm debugging with the code provided above.

With respect to point 2, in the training script the error occurs randomly not in a specific step and I believe this is due to batch loading is in non-deterministic order. But I'll try to debugging using those suggestions.

Again, thanks for your attention.

Apr 12 '22 14:04 rgsousa88

You can try disabling the shuffling (Comment out random.shuffle(self.files)), and/or, print the label during __next__, so that you can see the order of the files, and you could easily see if it's always the same one failing. Also, add prefetch_queue_depth=1 to the ExternalDebugSourcePipeline constructor call, so that you see a single batch at a time (DALI won't try to prefetch more batches than requested)

Apr 12 '22 14:04 jantonguirao

Hi @rgsousa88 ,

You can check the debug more to narrow down the corrupted image.

Oct 12 '22 10:10 JanuszL

DALI DALI copied to clipboard

Is it possible to debug DALI pipeline?

DALI
DALI copied to clipboard