DALI text + image dataset processing

I am trying to finetune only the projection layers of CLIP model, thus my pipeline is very CPU bottlenecked. I think (correct me if I am wrong) DALI is a perfect application for my case. However, I am struggling to understand how to include tokenization of the text input into the pipeline. I am using webdataset for datastorage.

After looking through the docs I found a pipeline example.

@pipeline_def(batch_size=batch_size, num_threads=4, device_id=0)
def wds_pipeline(wds_data=wds_data):
    img_raw, cls = fn.readers.webdataset(
        paths=wds_data,
        ext=["jpg", "cls"],
        missing_component_behavior="error")
    img = fn.decoders.image(img_raw, device="mixed", output_type=types.RGB)
    resized = fn.resize(img, device="gpu", resize_shorter=256.)
    output = fn.crop_mirror_normalize(
        resized,
        dtype=types.FLOAT,
        crop=(224, 224),
        mean=[0., 0., 0.],
        std=[1., 1., 1.])
    return output, cls

In my case cls should be replace by json. But then I will also need to access json["caption"] and tokenize this text. I am not sure how to do that and I can't find anything in the docs related to that. Any help would be greatly appreciated.

Aug 31 '22 10:08 skull8888888

Hi @skull8888888 !

Correct, DALI would help you remove the CPU bottleneck. One question though, is the tokenization an operation that creates the bottleneck? If so, I'm afraid we can't help you much - tokenization itself does not gain much from GPU implementation and DALI does not offer such an operator. Moreover, it's not really possible to work with strings in DALI.

However, if it's not the tokenization, for sure you may benefit from using DALI. In that case, I'd advice to perform the tokenization outside of DALI and then pass the tokenized data along DALI pipeline in the iterator. The example you can find in this Jasper preprocessing pipeline: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper/common/dali

You can find a custom iterator there: https://github.com/NVIDIA/DeepLearningExamples/blob/475cff63464736d836e1113a9839fc176bf67b38/PyTorch/SpeechRecognition/Jasper/common/dali/iterator.py#L36

which contains DALI iterator inside: https://github.com/NVIDIA/DeepLearningExamples/blob/475cff63464736d836e1113a9839fc176bf67b38/PyTorch/SpeechRecognition/Jasper/common/dali/iterator.py#L52

The custom iterator grabs data from DALI as well as normalized string and returns it: https://github.com/NVIDIA/DeepLearningExamples/blob/475cff63464736d836e1113a9839fc176bf67b38/PyTorch/SpeechRecognition/Jasper/common/dali/iterator.py#L103

Hopefully this is clear enough. Should you have any more question, do not hesitate to ask

Aug 31 '22 11:08 szalpal

Thank you for such a swift and detailed response! Tokenization is not a bottleneck at all, I am just searching for ways of incorporating it into the DALI pipeline. Do you think it would be a good idea to still tokenize on the fly, i.e. include tokenizer into custom iterator? Because I think it's a bit cumbersome to add tokenized arrays into webdataset.

Aug 31 '22 17:08 skull8888888

Turns out webdataset reader converts json into raw ascii and packages them into long type tensors. I pad these tensors and using links that you provided this is what I ended up doing.

@pipeline_def
def wds_pipeline(urls):
    img_raw, json = fn.readers.webdataset(
        paths=urls,
        ext=["jpg", "json"],
        missing_component_behavior="error",
        name="Reader")
    img = fn.decoders.image(img_raw, device="mixed", output_type=types.RGB)
    resized = fn.resize(img, device="gpu", resize_shorter=256.)
    output = fn.crop_mirror_normalize(
        resized,
        dtype=types.FLOAT,
        crop=(224, 224),
        device="gpu",
        mean=[0.48145466, 0.4578275, 0.40821073],
        std=[0.26862954, 0.26130258, 0.27577711])
    return output, fn.pad(json, device="cpu")

class DaliCLIPIterator(object):

    def __init__(self, urls):

        pipeline = wds_pipeline(batch_size=16, num_threads=4, device_id=0, urls=urls)

        self.dali_it = DALIGenericIterator(
            pipeline, 
            ["jpg", "json"],
            dynamic_shape=False, 
            reader_name="Reader",
            auto_reset=True,
            prepare_first_batch=False,
            last_batch_policy=LastBatchPolicy.DROP)


    def __next__(self):
        data = self.dali_it.__next__()

        ascii = data[0]["json"].numpy()

        captions = []
        for el in ascii:
            json_str = "".join([chr(i) for i in el if i != 0])
            text = json.loads(json_str)["caption"]
            captions.append(text)

        return data[0]["jpg"], clip.tokenize(captions, truncate=True)

    def next(self):
        return self.__next__()

    def __iter__(self):
        return self

Does it seem like a reasonable approach?

Seems to work in an iterator state. Now I am wondering can I straight up use it with torch DataLoader or should I write something specific since the example that you provided seem to implement their own DataLoader.

Aug 31 '22 23:08 skull8888888

Hi @skull8888888,

Your implementation looks okay. I think you don't need to use DataLoader, DaliCLIPIterator should do and you can iterate over it to get data samples.

Sep 01 '22 09:09 JanuszL