DALI
DALI copied to clipboard
Is it possible to debug DALI pipeline?
Hi,
I have a DALI pipeline with a external source reader but it fails when reading some images. The error is related to NVJPEG decoding process and it is shown below:
Error when executing Mixed operator decoders__Image encountered: Error in thread 0: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_helper.h:228] [/opt/dali/dali/image/jpeg_mem.cc:176] Assert on "jpeg_start_decompress(&cinfo)" failed
I think there is some corrupted file causing this error but until now I'm not able to detect which file is causing it. I've run a script using verify method from PIL.Image module but it hasn't found any corrupted file. So, I would like to know if it is possible to use my current DALI pipeline to track with file is causing it.
The code snippets of my external source and pipeline are provided below and also a method to verify images in pipeline. I can not share dataset due to its size and confidentiality. The csv_file is a debug version where which line has the format "image_path;label" and label is an integer value representing a image from dataset.
class ExternalDebugInputIterator(object):
def __init__(self, images_dir, batch_size, device_id, num_gpus):
self.images_dir = images_dir
self.batch_size = batch_size
with open(self.images_dir, 'r') as file:
self.files = [line.rstrip() for line in file if line != '' and line.split(';')[0]]
self.data_set_len = len(self.files)
self.files = self.files[self.data_set_len * device_id // num_gpus:
self.data_set_len * (device_id + 1) // num_gpus]
self.n = len(self.files)
def __iter__(self):
self.i = 0
random.shuffle(self.files)
return self
def __next__(self):
batch = []
labels = []
if self.i >= self.n:
self.__iter__()
raise StopIteration
for _ in range(self.batch_size):
jpeg_filename, label = self.files[self.i % self.n].split(';')
batch.append(np.fromfile(jpeg_filename, dtype='uint8'))
labels.append(np.array(label,dtype='int64'))
self.i += 1
return (batch, labels)
def __len__(self):
return self.data_set_len
next = __next__
def ExternalDebugSourcePipeline(batch_size, num_threads, device_id, external_data):
pipe = Pipeline(batch_size, num_threads, device_id)
with pipe:
jpegs, labels = fn.external_source(source=external_data, num_outputs=2)
images = fn.decoders.image(jpegs, device="mixed")
images = fn.resize(images, size=(224,224), device='gpu')
images = fn.crop_mirror_normalize(images, std=255.0)
images = fn.cast(images, dtype=types.FLOAT)
labels = fn.cast(labels, dtype=types.INT64)
pipe.set_outputs(images, labels)
return pipe
def verify_dali_pipeline(csv_file):
names = ['imgs','labels']
print(f"Analysing {csv_file}\n")
external_source = ExternalDebugInputIterator(csv_file, batch_size=1, device_id=0, num_gpus=1)
pipe = ExternalDebugSourcePipeline(batch_size=1, #Using batch_size of 1 to verify which file is causing error
num_threads=1,
device_id=0,
external_data=external_source)
loader = DALIGenericIterator(pipe, output_map=names, last_batch_padded=True, last_batch_policy=LastBatchPolicy.PARTIAL)
samples = external_source.n
print(f"{csv_file} has {samples} samples")
steps = np.ceil(samples / (1 * 1))
t = tqdm(loader, unit='batch',total=steps)
with open("error.log",'a') as log:
for d in t:
try:
data, label = d[0]['imgs'],d[0]['labels']
except Exception as e:
print(f'{e}')
print(f'{label}')
log.write(f'{label}\n')
Thanks for your time and help!!
Hi @rgsousa88,
This error is coming from the libjpeg-turbo
library. It is likely that the data arriving to the decoder is somehow corrupted. To narrow down the problem, I would try the following
- To discard any issues with your custom data source, I'd try using
fn.readers.file
instead of external source, and going through the dataset. If this works, the issue might be with the data loader. Note: For this kind of usage (loading files and labels) you don't need the external source,fn.readers.file
would be enough. - Feed the data to external source in a deterministic order, and count the iteration number when this issue happen. Does it happen always on the same image? Is it random? If it happens always on the same sample, modify the pipeline to output the raw jpeg instead of the decoded one and save it to a file again. Does it match the original file? Can you open this file with an image viewer?
- Try
fn.decoders.image(..., device='cpu')
to rule out an issue in the mixed backend implementation. In any case, your error message comes from the libjpeg-turbo (not nvjpeg), which suggests that nvjpeg could not decode the image and we are falling back to a CPU decoder (libjpeg-turbo based) as a last resort.
Let me know if any of those ideas reveal more information. Thanks
Hi, @jantonguirao,
First of all, thank for your time and suggestions. The reason I'm using an ExternalSource as loader is due to my real csv_file (not debug version described in the post) is in a format that do not permit be used with fn.readers.file
. This "real" annotation file is in the format: image_path;label_1 label_2 ... label_n
. So, this is why I'm debugging with the code provided above.
With respect to point 2, in the training script the error occurs randomly not in a specific step and I believe this is due to batch loading is in non-deterministic order. But I'll try to debugging using those suggestions.
Again, thanks for your attention.
You can try disabling the shuffling (Comment out random.shuffle(self.files)
), and/or, print the label during __next__
, so that you can see the order of the files, and you could easily see if it's always the same one failing. Also, add prefetch_queue_depth=1
to the ExternalDebugSourcePipeline
constructor call, so that you see a single batch at a time (DALI won't try to prefetch more batches than requested)