CMU-MultimodalSDK MOSEI - transcript mis-aligned?

Hello @A2Zadeh

First of all, thank you for sharing all this work.

I am trying to use MOSEI with raw data in order to extract my own acoustic features. So I want to extract audio segments in the wav files corresponding to the sentiment annotations.

When I check the timestamps I get in the 'intervals' structure for the file ZZzdvUdOTww.wav, it appears that the speech is only partly transcripted (about 1 minute starting from 8 min 50), whereas the words timestamps in the computational sequence are in the range [0.01, 59.99] seconds.

Did I miss a field related to a time offset somewhere in the data, or is there a problem with this specific content?

thank you in advance

Mar 23 '21 17:03 wikong

Hi @wikong,

Thanks for your interest in MOSEI, and hope this issue can be easily resolved. I will have a look at this. Does the splitted video seem to have a different transcription?

Mar 23 '21 20:03 A2Zadeh

I am not sure what splitted video exactly refers to. In folder Raw/Videos/Segmented/Combined there is actually no file corresponding to that id. In file Raw/Transcript/Segmented/Combined/ZZzdvUdOTww.txt there are 6 segments corresponding to about 30 seconds of speech (located at 8min48 in the original video). The file Raw/Transcript/Full/Combined/ZZzdvUdOTww.plaintext seems to have the speech transcripted from the whole video. And I have about 60 seconds of transcript (as described in my first message) when I use the python package as follows:

from mmsdk import mmdatasdk
ds_dic = {}
ds_dic['labels']="mosei/labels/CMU_MOSEI_Labels.csd"
ds_dic['words']="mosei/raw/CMU_MOSEI_TimestampedWords.csd"
ds = mmdatasdk.mmdataset(ds_dic)

key = "ZZzdvUdOTww"

hds = ds.computational_sequences['labels'][key]

print(hds['features'][:,:])
print(hds['intervals'][:,:]) 
# The previous line returns 6 segments between -0.48752834 and 29.05759637
print(ds.computational_sequences['words'][key]['features'][:,:])
print(ds.computational_sequences['words'][key]['intervals'][:,:]) 
# the previous line returns 184 words timestamped from 1.24716553e-02 to 5.99961451e+01
# text : though larceny is already ... vulnerable citizens out of the process
# these words are actually located at 8min48

Mar 24 '21 09:03 wikong

CMU-MultimodalSDK CMU-MultimodalSDK copied to clipboard

MOSEI - transcript mis-aligned?

CMU-MultimodalSDK
CMU-MultimodalSDK copied to clipboard