CMU-MultimodalSDK
CMU-MultimodalSDK copied to clipboard
MOSEI - transcript mis-aligned?
Hello @A2Zadeh
First of all, thank you for sharing all this work.
I am trying to use MOSEI with raw data in order to extract my own acoustic features. So I want to extract audio segments in the wav files corresponding to the sentiment annotations.
When I check the timestamps I get in the 'intervals' structure for the file ZZzdvUdOTww.wav, it appears that the speech is only partly transcripted (about 1 minute starting from 8 min 50), whereas the words timestamps in the computational sequence are in the range [0.01, 59.99] seconds.
Did I miss a field related to a time offset somewhere in the data, or is there a problem with this specific content?
thank you in advance
Hi @wikong,
Thanks for your interest in MOSEI, and hope this issue can be easily resolved. I will have a look at this. Does the splitted video seem to have a different transcription?
I am not sure what splitted video exactly refers to.
In folder Raw/Videos/Segmented/Combined there is actually no file corresponding to that id.
In file Raw/Transcript/Segmented/Combined/ZZzdvUdOTww.txt there are 6 segments corresponding to about 30 seconds of speech (located at 8min48 in the original video).
The file Raw/Transcript/Full/Combined/ZZzdvUdOTww.plaintext seems to have the speech transcripted from the whole video.
And I have about 60 seconds of transcript (as described in my first message) when I use the python package as follows:
from mmsdk import mmdatasdk
ds_dic = {}
ds_dic['labels']="mosei/labels/CMU_MOSEI_Labels.csd"
ds_dic['words']="mosei/raw/CMU_MOSEI_TimestampedWords.csd"
ds = mmdatasdk.mmdataset(ds_dic)
key = "ZZzdvUdOTww"
hds = ds.computational_sequences['labels'][key]
print(hds['features'][:,:])
print(hds['intervals'][:,:])
# The previous line returns 6 segments between -0.48752834 and 29.05759637
print(ds.computational_sequences['words'][key]['features'][:,:])
print(ds.computational_sequences['words'][key]['intervals'][:,:])
# the previous line returns 184 words timestamped from 1.24716553e-02 to 5.99961451e+01
# text : though larceny is already ... vulnerable citizens out of the process
# these words are actually located at 8min48