bsl1k icon indicating copy to clipboard operation
bsl1k copied to clipboard

Extracting BSL-1k clips from BOBSL

Open hshreeshail opened this issue 2 years ago • 5 comments

This issue is in reference to extracting the video clips of individual signs from BOBSL that form the BSL-1k dataset. In mouthing_spottings.json, the global_times annotation is a single timestamp value (instead of a (start, end) range). How, do I extract the corresponding clip from this? Are all the clips of same length?

hshreeshail avatar Jan 23 '23 13:01 hshreeshail

Just read section A.3 from the appendix. So, can I assume that the timestamped frame is the last frame of the clip and take 24 frames before it? P.S: I am assuming that the mouthing_spottings.json file in the BOBSL dataset corresponds to BSL-1K

hshreeshail avatar Jan 23 '23 16:01 hshreeshail

Couple of more queries: 1] Why are the global_times annotations in seconds rather than the frame number? Is it to allow for different frame rates? 2] For the default setting with frame_rate = 25, if a timestamp is sss.mmm (seconds and milliseconds), shoudn't the milliseconds part be a multiple of 40 (=1000/25)? But the values from the annotations file do not satisfy this property.

hshreeshail avatar Jan 24 '23 09:01 hshreeshail

Thank you for your questions. We’ve addressed them below, please let us know if anything is unclear:

(1) BOBSL vs BSL1K - Although BOBSL and BSL1K are constructed in a similar manner, they cover different sign language interpreted TV shows, and therefore contain different annotations. BSL1K is described in the paper "BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues", ECCV'20, and is not released. BOBSL is released and is described in the paper "BOBSL: BBC-Oxford British Sign Language Dataset", arXiv'21.

(2) Annotation is a single timestamp. - Yes, we only record a point in time that gives maximum response over a search timeline. Since these are automatically mined annotations, we do not have accurate start/end times. We experimentally determine a fixed clip length around these points based on the annotation type. See the next point.

(3) How to extract clips from the annotations? Are all the clips of the same length? - For the original annotations from spottings.tar.gz, the windows around mouthing M, dictionary D and attention A times should be: [-15,4], [-3,22] and [-8, 18] respectively. New and more annotations for BOBSL can be downloaded from the ECCV'22 paper. From experimenting with different windows around annot_time key, we find the following to work best: M* [-9, 11], D* [-3, 22], P [0, 19], E [0, 19], N [0, 19]. Please find details on these annotations in the paper in this link. We randomly sample 16 contiguous frames from these windows for training, and perform sliding window averaging for test. Moreover, find the helper script at misc/bsl1k/extract_clips.py which you would need to modify by setting --num_frames_before and --num_frames_after arguments.

(4) Why are the global_times annotations in seconds rather than the frame number? Is it to allow for different frame rates? Yes - you should be able to find the frames easily.

(5) For the default setting with frame_rate = 25, if a timestamp is sss.mmm (seconds and milliseconds), shoudn’t the milliseconds part be a multiple of 40 (=1000/25)? - We’ve rounded the times to 3dp.

gulvarol avatar Jan 27 '23 14:01 gulvarol

Thanks for the clarification. Is there any estimate on if/when BSL-1k will be released? Thank you.

hshreeshail avatar Jan 27 '23 19:01 hshreeshail

BSL-1K will not be released, sorry for the outdated repository. We have released BOBSL instead by reproducing all the papers where we had used BSL-1K.

gulvarol avatar Jan 30 '23 20:01 gulvarol