returnn
returnn copied to clipboard
ExternSprintDataset with non-ascii speaker names
I have a corpus with segments that have
<speaker name="Іrina Gerashchenko"/>
attached to them.
When I want to use it in an ExternSprintDataset (e.g. with dump dataset), the process crashes with
EXCEPTION
Traceback (most recent call last):
File "/home/wmichel/software/returnn/returnn/datasets/sprint.py", line 895, in ExternSprintDataset._reader_thread_proc
line: data_type, args = self._read_next_raw()
locals:
data_type = <local> b'data', len = 4
args = <local> (b'corpus/Axxon.MixedMedia-batch.1/Address_PromovaPoroshenko_20190606/Address_PromovaPoroshenko_20190606_00097_1139.98-1153.9', array([[-3.4821649 , -2.9676483 , -2.2865193 , ..., 0.3463574 ,
0.13466196, 0.18409058],
[-0.52613807, -0.6804885 , -1.2733397 , ..., 1.402299 ,
..., _[0]: {len = 122}
self = <local> <ExternSprintDataset 'dataset_id139711693819408' epoch=1>
self._read_next_raw = <local> <bound method ExternSprintDataset._read_next_raw of <ExternSprintDataset 'dataset_id139711693819408' epoch=1>>
File "/home/wmichel/software/returnn/returnn/datasets/sprint.py", line 856, in ExternSprintDataset._read_next_raw
line: data_type, args = Unpickler(stream, encoding="bytes").load()
locals:
data_type = <not found>
args = <not found>
u = <local> <_pickle.Unpickler object at 0x7f4bea8ed160>
u.load = <local> <built-in method load of _pickle.Unpickler object at 0x7f4bea8ed160>
TypeError: 'bytes' object is not callable
I use python3.6 with returnn (07ed4a7d4b2b7aa17a8c1eb6965754d4cf3521a3) and RASR (32b02bdb2c80d9e38c60da40246f461766751b55 - apptek version plus a cherry-pick of commit 843d1eb5f2137b57dcc6cf4bc9dce44e85d66f27 from github) is also compiled with python3.6
$ echo "Іrina Gerashchenko" | hexdump -C
00000000 d0 86 72 69 6e 61 20 47 65 72 61 73 68 63 68 65 |..rina Gerashche|
00000010 6e 6b 6f 0a |nko.|
00000014
$ echo "Irina Gerashchenko" | hexdump -C
00000000 49 72 69 6e 61 20 47 65 72 61 73 68 63 68 65 6e |Irina Gerashchen|
00000010 6b 6f 0a |ko.|
00000013
the upper one throws the error, the lower one (or removing the speaker info) works fine.
a similar problem appears with non-ascii segment names:
Traceback (most recent call last):
File "/home/wmichel/software/returnn/returnn/datasets/sprint.py", line 895, in ExternSprintDataset._reader_thread_proc
line: data_type, args = self._read_next_raw()
locals:
data_type = <local> b'data', len = 4
args = <local> (b'corpus/Axxon.MixedMedia-batch.1/Informational_pershy_na_seli_20191227/Informational_pershy_na_seli_20191227_00157_1435.92-1446.99', array([[-1.105067 , -0.97892445, -0.87034935, ..., 1.0514063 ,
0.63383573, 0.36358422],
[ 0.10235078, 0.53916615, 0.59086025, ..., -0.8045798..., _[0]: {len = 129}
self = <local> <ExternSprintDataset 'dataset_id139660939705984' epoch=1>
self._read_next_raw = <local> <bound method ExternSprintDataset._read_next_raw of <ExternSprintDataset 'dataset_id139660939705984' epoch=1>>
File "/home/wmichel/software/returnn/returnn/datasets/sprint.py", line 856, in ExternSprintDataset._read_next_raw
line: data_type, args = Unpickler(stream, encoding="bytes").load()
locals:
data_type = <not found>
args = <not found>
u = <local> <_pickle.Unpickler object at 0x7f0558aee160>
u.load = <local> <built-in method load of _pickle.Unpickler object at 0x7f0558aee160>
UnpicklingError: invalid load key, '4'.
with segment names containing
$ echo Informational_Рershy_na_selі_20170214 | hexdump -C
00000000 49 6e 66 6f 72 6d 61 74 69 6f 6e 61 6c 5f d0 a0 |Informational_..|
00000010 65 72 73 68 79 5f 6e 61 5f 73 65 6c d1 96 5f 32 |ershy_na_sel.._2|
00000020 30 31 37 30 32 31 34 0a |0170214.|
00000028
$ echo Informational_pershy_na_seli_20191227 | hexdump -C
00000000 49 6e 66 6f 72 6d 61 74 69 6f 6e 61 6c 5f 70 65 |Informational_pe|
00000010 72 73 68 79 5f 6e 61 5f 73 65 6c 69 5f 32 30 31 |rshy_na_seli_201|
00000020 39 31 32 32 37 0a |91227.|
00000026
lower segments work while upper segments crash
Okay this seems to be an issue with the pickle protocol and non-ascii content...
It might be related to the comment that is exactly above:
# Cannot use utf8 because Numpy will also encode the data as strings and there we need it as bytes.
What Python version do you use on RETURNN side, and what Python version on RASR side?
Okay this seems to be an issue with the pickle protocol and non-ascii content...
for the orth non-ascii characters work fine
e.g.
seq 2143 target 'orth': array([208, 188, 208, 184, 32, 208, 189, 208, 176, 208, 191, 209, 128,
208, 184, 208, 186, 208, 187, 208, 176, 208, 180, 32, 209, 129,
208, 178, 208, 190, 209, 142, 32, 209, 129, 209, 130, 208, 176,
209, 130, 208, 184, 209, 129, 209, 130, 208, 184, 208, 186, 209,
13..., shape=(440,) ('ми наприклад свою статистику повністю бачимо вже в офіційному вигляді десь ее т близько ее березень березень квітень повністю всі показники [noise] ее світова статистика вона трішки відстає бачимо тільки те що можна прочитати в інтернеті [noise]')
What Python version do you use on RETURNN side, and what Python version on RASR side?
Both use python3.6
Not exactly the same, but I also faced a problem during unpickling with some Vietnamese characters in the transcriptions inside the OggZipDataset
. That was with python3.8, a recent RETURNN version and independent of RASR.
[...]
File ".../returnn/returnn/datasets/audio.py", line 132, in OggZipDataset.__init__
line: self._data = self._collect_data()
locals:
self = <local> <OggZipDataset 'dev_ogg' epoch=None>
self._data = <local> !AttributeError: 'OggZipDataset' object has no attribute '_data'
self._collect_data = <local> <bound method OggZipDataset._collect_data of <OggZipDataset 'dev_ogg' epoch=None>>
File ".../returnn/returnn/datasets/audio.py", line 191, in OggZipDataset._collect_data
line: zip_data = self._collect_data_part(zip_index)
locals:
zip_data = <not found>
self = <local> <OggZipDataset 'dev_ogg' epoch=None>
self._collect_data_part = <local> <bound method OggZipDataset._collect_data_part of <OggZipDataset 'dev_ogg' epoch=None>>
zip_index = <local> 0
File ".../returnn/returnn/datasets/audio.py", line 164, in OggZipDataset._collect_data_part
line: data = literal_eval(self._read("%s.txt" % self._names[zip_index], zip_index)) # type: typing.List[typing.Dict[str]]
locals:
data = <not found>
literal_eval = <local> <function literal_eval at 0x7f732c6a61f0>
self = <local> <OggZipDataset 'dev_ogg' epoch=None>
self._read = <local> <bound method OggZipDataset._read of <OggZipDataset 'dev_ogg' epoch=None>>
self._names = <local> ['out.ogg'], _[0]: {len = 7}
zip_index = <local> 0
File ".../returnn/returnn/util/literal_py_to_pickle.py", line 24, in literal_eval
line: raw_pickle = py_to_pickle(s)
locals:
raw_pickle = <not found>
py_to_pickle = <global> <function py_to_pickle at 0x7f732c6a6280>
s = <local> b'[\n{\'text\': \'c\xc3\xa2n n\xe1\xba\xb7ng \xc4\x91\xc6\xb0\xe1\xbb\xa3c \xc3\xa1p d\xe1\xbb\xa5ng l\xc3\xa0 m\xe1\xbb\xa9c c\xc3\xa2n n\xe1\xba\xb7ng cao h\xc6\xa1n c\xe1\xbb\xa7a tr\xe1\xbb\x8dng l\xc6\xb0\xe1\xbb\xa3ng th\xe1\xbb\xb1c v\xc3\xa0 kh\xe1\xbb\x91i l\xc6\xb0\xe1\xbb\xa3ng quy \xc..., len = 1194265
File "...returnn/returnn/util/literal_py_to_pickle.py", line 48, in py_to_pickle
line: assert res == 0, "there was some error"
locals:
res = <local> 1
AssertionError: there was some error
@vieting This looks very much like an independent problem. Can you open a separate issue on that? Also, when it prints "there was some error", there should have been some error which you should see in the log somewhere.
@vieting This looks very much like an independent problem. Can you open a separate issue on that? Also, when it prints "there was some error", there should have been some error which you should see in the log somewhere.
Sure, I'll do that.
@michelwi This is an extremely old version. Can you test with the current version? (Please always do that.)
Just as a note: d0 86
is the UTF8 byte sequence for the unicode char U+0406
which is CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
("І").
Okay this seems to be an issue with the pickle protocol and non-ascii content...
It might be related to the comment that is exactly above:
# Cannot use utf8 because Numpy will also encode the data as strings and there we need it as bytes.
This should not be related. As the comment just before that says: "encoding is for converting Python2 strings to Python3.". This is not relevant here.
But again, as I wrote before, there have been some changes in the last year on this. Before we invest any further time on this, this should be tested with the latest RETURNN version fist.
@michelwi What is the status here?
Sorry for the delay, I just tested it with the current returnn head and the same error persists:
EXCEPTION
Traceback (most recent call last):
File "/nas/models/asr/wmichel/setups/debugging/returnn/returnn/datasets/sprint.py", line 902, in ExternSprintDataset._reader_thread_proc
line: data_type, args = self._read_next_raw()
locals:
data_type = <local> b'data', len = 4
args = <local> (b'corpus/Axxon.MixedMedia-batch.1/Informational_pershy_na_seli_20191227/Informational_pershy_na_seli_20191227_00117_1090.07-1092.72', array([[-1.3290231 , -1.1278104 , -1.1955391 , ..., -2.1946402 ,
-2.2877762 , -2.5685613 ],
[-1.1158917 , -1.0490893 , -1.1935912 , ..., 1.1074646..., _[0]: {len = 129}
self = <local> <ExternSprintDataset 'dataset_id140408180674512' epoch=1>
self._read_next_raw = <local> <bound method ExternSprintDataset._read_next_raw of <ExternSprintDataset 'dataset_id140408180674512' epoch=1>>
File "/nas/models/asr/wmichel/setups/debugging/returnn/returnn/datasets/sprint.py", line 864, in ExternSprintDataset._read_next_raw
line: data_type, args = Unpickler(stream, encoding="bytes").load()
locals:
data_type = <not found>
args = <not found>
Unpickler = <global> <class '_pickle.Unpickler'>
stream = <local> <_io.BytesIO object at 0x7fb353b7f0a0>
encoding = <not found>
load = <not found>
UnpicklingError: invalid load key, '9'.
I have been playing around a bit trying to create a reproducible minimal setup and some other errors that came up were
UnpicklingError: could not find MARK
UnpicklingError: unpickling stack underflow
all at the same position in the code.