StepFormer
StepFormer copied to clipboard
How to make shard files of the Howto100M dataset?
@hadjisma @lavenderrz Your code mentions a directory containing shard files formatted as tar. https://github.com/SamsungLabs/StepFormer/blob/31f62679536177e7bc8e132b5611ee596f427fab/data/tar_loader.py#L35
[Question] Could you give me a code to reproduce them or just a key-value pair to write in?
Hi, video features in these tar files are created by extracting MIL-NCE features following https://github.com/ArrowLuo/VideoFeatureExtractor and followed by univl model.
Thank you for replying! I'll try it.
@lavenderrz You split the train/val by shard files. https://github.com/SamsungLabs/StepFormer/blob/31f62679536177e7bc8e132b5611ee596f427fab/data/tar_loader.py#L38
What's the unit of a shard file?
Does a shard file correspond to ONE video by wds.ShardWriter(f'shards-{video_id}.tar')
or SOME videos by wds.ShardWriter('shards-%05d.tar', maxsize=int(50 * 1000**2)) # 50MB
?
@Y-Haneji Hi have you tried to extract features with UniVL? Could you share the script of that? That will help a lot, thank you!
@HankKung No, I haven't. I tried another encoder and can't share the whole code about the ongoing research. Below is the pseudo-code, and I hope it helps you. Please ask the author more questions.
import webdataset as wds
with wds.ShardWriter("shard-%06d.tar", maxsize=5e8) as sink: # 500MB
for video in videos:
shard = {
"__key__": name,
"pickle": {
"video_features": video_features,
"text_features": text_features,
"json": annotations,
"name": name,
},
}
sink.write(shard)