StepFormer icon indicating copy to clipboard operation
StepFormer copied to clipboard

How to make shard files of the Howto100M dataset?

Open Y-Haneji opened this issue 11 months ago • 5 comments

@hadjisma @lavenderrz Your code mentions a directory containing shard files formatted as tar. https://github.com/SamsungLabs/StepFormer/blob/31f62679536177e7bc8e132b5611ee596f427fab/data/tar_loader.py#L35

[Question] Could you give me a code to reproduce them or just a key-value pair to write in?

Y-Haneji avatar Mar 02 '24 03:03 Y-Haneji

Hi, video features in these tar files are created by extracting MIL-NCE features following https://github.com/ArrowLuo/VideoFeatureExtractor and followed by univl model.

lavenderrz avatar Mar 05 '24 21:03 lavenderrz

Thank you for replying! I'll try it.

Y-Haneji avatar Mar 06 '24 14:03 Y-Haneji

@lavenderrz You split the train/val by shard files. https://github.com/SamsungLabs/StepFormer/blob/31f62679536177e7bc8e132b5611ee596f427fab/data/tar_loader.py#L38

What's the unit of a shard file? Does a shard file correspond to ONE video by wds.ShardWriter(f'shards-{video_id}.tar') or SOME videos by wds.ShardWriter('shards-%05d.tar', maxsize=int(50 * 1000**2)) # 50MB?

Y-Haneji avatar Apr 15 '24 07:04 Y-Haneji

@Y-Haneji Hi have you tried to extract features with UniVL? Could you share the script of that? That will help a lot, thank you!

HankKung avatar Jul 21 '24 20:07 HankKung

@HankKung No, I haven't. I tried another encoder and can't share the whole code about the ongoing research. Below is the pseudo-code, and I hope it helps you. Please ask the author more questions.

import webdataset as wds

with wds.ShardWriter("shard-%06d.tar", maxsize=5e8) as sink:  # 500MB
  for video in videos:
    shard = {
        "__key__": name,
        "pickle": {
            "video_features": video_features,
            "text_features": text_features,
            "json": annotations,
            "name": name,
        },
    }
    sink.write(shard)

Y-Haneji avatar Jul 31 '24 06:07 Y-Haneji