returnn ExternSprintDataset with non-ascii speaker names

I have a corpus with segments that have <speaker name="Іrina Gerashchenko"/> attached to them.

When I want to use it in an ExternSprintDataset (e.g. with dump dataset), the process crashes with

EXCEPTION
Traceback (most recent call last):
  File "/home/wmichel/software/returnn/returnn/datasets/sprint.py", line 895, in ExternSprintDataset._reader_thread_proc
line: data_type, args = self._read_next_raw()
locals:
      data_type = <local> b'data', len = 4
      args = <local> (b'corpus/Axxon.MixedMedia-batch.1/Address_PromovaPoroshenko_20190606/Address_PromovaPoroshenko_20190606_00097_1139.98-1153.9', array([[-3.4821649 , -2.9676483 , -2.2865193 , ...,  0.3463574 ,
                              0.13466196,  0.18409058],
                            [-0.52613807, -0.6804885 , -1.2733397 , ...,  1.402299  ,
                            ..., _[0]: {len = 122}
      self = <local> <ExternSprintDataset 'dataset_id139711693819408' epoch=1>
      self._read_next_raw = <local> <bound method ExternSprintDataset._read_next_raw of <ExternSprintDataset 'dataset_id139711693819408' epoch=1>>
  File "/home/wmichel/software/returnn/returnn/datasets/sprint.py", line 856, in ExternSprintDataset._read_next_raw
line: data_type, args = Unpickler(stream, encoding="bytes").load()
    locals:
      data_type = <not found>
      args = <not found>
      u = <local> <_pickle.Unpickler object at 0x7f4bea8ed160>
      u.load = <local> <built-in method load of _pickle.Unpickler object at 0x7f4bea8ed160>
TypeError: 'bytes' object is not callable

I use python3.6 with returnn (07ed4a7d4b2b7aa17a8c1eb6965754d4cf3521a3) and RASR (32b02bdb2c80d9e38c60da40246f461766751b55 - apptek version plus a cherry-pick of commit 843d1eb5f2137b57dcc6cf4bc9dce44e85d66f27 from github) is also compiled with python3.6

$ echo "Іrina Gerashchenko" | hexdump -C
00000000  d0 86 72 69 6e 61 20 47  65 72 61 73 68 63 68 65  |..rina Gerashche|
00000010  6e 6b 6f 0a                                       |nko.|
00000014

$ echo "Irina Gerashchenko" | hexdump -C
00000000  49 72 69 6e 61 20 47 65  72 61 73 68 63 68 65 6e  |Irina Gerashchen|
00000010  6b 6f 0a                                          |ko.|
00000013

the upper one throws the error, the lower one (or removing the speaker info) works fine.

Mar 23 '22 12:03 michelwi

a similar problem appears with non-ascii segment names:

Traceback (most recent call last):
  File "/home/wmichel/software/returnn/returnn/datasets/sprint.py", line 895, in ExternSprintDataset._reader_thread_proc
    line: data_type, args = self._read_next_raw()
    locals:
      data_type = <local> b'data', len = 4
      args = <local> (b'corpus/Axxon.MixedMedia-batch.1/Informational_pershy_na_seli_20191227/Informational_pershy_na_seli_20191227_00157_1435.92-1446.99', array([[-1.105067  , -0.97892445, -0.87034935, ...,  1.0514063 ,
                              0.63383573,  0.36358422],
                            [ 0.10235078,  0.53916615,  0.59086025, ..., -0.8045798..., _[0]: {len = 129}
      self = <local> <ExternSprintDataset 'dataset_id139660939705984' epoch=1>
      self._read_next_raw = <local> <bound method ExternSprintDataset._read_next_raw of <ExternSprintDataset 'dataset_id139660939705984' epoch=1>>
  File "/home/wmichel/software/returnn/returnn/datasets/sprint.py", line 856, in ExternSprintDataset._read_next_raw
    line: data_type, args = Unpickler(stream, encoding="bytes").load()
    locals:
      data_type = <not found>
      args = <not found>
      u = <local> <_pickle.Unpickler object at 0x7f0558aee160>
      u.load = <local> <built-in method load of _pickle.Unpickler object at 0x7f0558aee160>
UnpicklingError: invalid load key, '4'.

with segment names containing

$ echo Informational_Рershy_na_selі_20170214 | hexdump -C
00000000  49 6e 66 6f 72 6d 61 74  69 6f 6e 61 6c 5f d0 a0  |Informational_..|
00000010  65 72 73 68 79 5f 6e 61  5f 73 65 6c d1 96 5f 32  |ershy_na_sel.._2|
00000020  30 31 37 30 32 31 34 0a                           |0170214.|
00000028

$ echo Informational_pershy_na_seli_20191227 | hexdump -C
00000000  49 6e 66 6f 72 6d 61 74  69 6f 6e 61 6c 5f 70 65  |Informational_pe|
00000010  72 73 68 79 5f 6e 61 5f  73 65 6c 69 5f 32 30 31  |rshy_na_seli_201|
00000020  39 31 32 32 37 0a                                 |91227.|
00000026

lower segments work while upper segments crash

Mar 23 '22 17:03 michelwi

Okay this seems to be an issue with the pickle protocol and non-ascii content...

It might be related to the comment that is exactly above: # Cannot use utf8 because Numpy will also encode the data as strings and there we need it as bytes.

Mar 23 '22 17:03 JackTemaki

What Python version do you use on RETURNN side, and what Python version on RASR side?

Mar 23 '22 18:03 albertz

Okay this seems to be an issue with the pickle protocol and non-ascii content...

for the orth non-ascii characters work fine

e.g.

seq 2143 target 'orth': array([208, 188, 208, 184,  32, 208, 189, 208, 176, 208, 191, 209, 128,
       208, 184, 208, 186, 208, 187, 208, 176, 208, 180,  32, 209, 129,
       208, 178, 208, 190, 209, 142,  32, 209, 129, 209, 130, 208, 176,
       209, 130, 208, 184, 209, 129, 209, 130, 208, 184, 208, 186, 209,
       13..., shape=(440,) ('ми наприклад свою статистику повністю бачимо вже в офіційному вигляді десь ее т близько ее березень березень квітень повністю всі показники [noise] ее світова статистика вона трішки відстає бачимо тільки те що можна прочитати в інтернеті [noise]')

What Python version do you use on RETURNN side, and what Python version on RASR side?

Both use python3.6

Mar 23 '22 18:03 michelwi

Not exactly the same, but I also faced a problem during unpickling with some Vietnamese characters in the transcriptions inside the OggZipDataset. That was with python3.8, a recent RETURNN version and independent of RASR.

[...]
  File ".../returnn/returnn/datasets/audio.py", line 132, in OggZipDataset.__init__
    line: self._data = self._collect_data()
    locals:
      self = <local> <OggZipDataset 'dev_ogg' epoch=None>
      self._data = <local> !AttributeError: 'OggZipDataset' object has no attribute '_data'
      self._collect_data = <local> <bound method OggZipDataset._collect_data of <OggZipDataset 'dev_ogg' epoch=None>>
  File ".../returnn/returnn/datasets/audio.py", line 191, in OggZipDataset._collect_data
    line: zip_data = self._collect_data_part(zip_index)
    locals:
      zip_data = <not found>
      self = <local> <OggZipDataset 'dev_ogg' epoch=None>
      self._collect_data_part = <local> <bound method OggZipDataset._collect_data_part of <OggZipDataset 'dev_ogg' epoch=None>>
      zip_index = <local> 0
  File ".../returnn/returnn/datasets/audio.py", line 164, in OggZipDataset._collect_data_part
    line: data = literal_eval(self._read("%s.txt" % self._names[zip_index], zip_index))  # type: typing.List[typing.Dict[str]]
    locals:
      data = <not found>
      literal_eval = <local> <function literal_eval at 0x7f732c6a61f0>
      self = <local> <OggZipDataset 'dev_ogg' epoch=None>
      self._read = <local> <bound method OggZipDataset._read of <OggZipDataset 'dev_ogg' epoch=None>>
      self._names = <local> ['out.ogg'], _[0]: {len = 7}
      zip_index = <local> 0
  File ".../returnn/returnn/util/literal_py_to_pickle.py", line 24, in literal_eval
    line: raw_pickle = py_to_pickle(s)
    locals:
      raw_pickle = <not found>
      py_to_pickle = <global> <function py_to_pickle at 0x7f732c6a6280>
      s = <local> b'[\n{\'text\': \'c\xc3\xa2n n\xe1\xba\xb7ng \xc4\x91\xc6\xb0\xe1\xbb\xa3c \xc3\xa1p d\xe1\xbb\xa5ng l\xc3\xa0 m\xe1\xbb\xa9c c\xc3\xa2n n\xe1\xba\xb7ng cao h\xc6\xa1n c\xe1\xbb\xa7a tr\xe1\xbb\x8dng l\xc6\xb0\xe1\xbb\xa3ng th\xe1\xbb\xb1c v\xc3\xa0 kh\xe1\xbb\x91i l\xc6\xb0\xe1\xbb\xa3ng quy \xc..., len = 1194265
  File "...returnn/returnn/util/literal_py_to_pickle.py", line 48, in py_to_pickle
    line: assert res == 0, "there was some error"
    locals:
      res = <local> 1
AssertionError: there was some error

Mar 24 '22 08:03 vieting

@vieting This looks very much like an independent problem. Can you open a separate issue on that? Also, when it prints "there was some error", there should have been some error which you should see in the log somewhere.

Mar 24 '22 11:03 albertz

@vieting This looks very much like an independent problem. Can you open a separate issue on that? Also, when it prints "there was some error", there should have been some error which you should see in the log somewhere.

Sure, I'll do that.

Mar 24 '22 11:03 vieting

@michelwi This is an extremely old version. Can you test with the current version? (Please always do that.)

Mar 24 '22 11:03 albertz

Just as a note: d0 86 is the UTF8 byte sequence for the unicode char U+0406 which is CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I ("І").

Mar 25 '22 21:03 albertz

Okay this seems to be an issue with the pickle protocol and non-ascii content...

It might be related to the comment that is exactly above: # Cannot use utf8 because Numpy will also encode the data as strings and there we need it as bytes.

This should not be related. As the comment just before that says: "encoding is for converting Python2 strings to Python3.". This is not relevant here.

But again, as I wrote before, there have been some changes in the last year on this. Before we invest any further time on this, this should be tested with the latest RETURNN version fist.

Mar 25 '22 21:03 albertz

@michelwi What is the status here?

May 14 '22 19:05 albertz

Sorry for the delay, I just tested it with the current returnn head and the same error persists:

EXCEPTION
Traceback (most recent call last):
  File "/nas/models/asr/wmichel/setups/debugging/returnn/returnn/datasets/sprint.py", line 902, in ExternSprintDataset._reader_thread_proc
    line: data_type, args = self._read_next_raw()
    locals:
      data_type = <local> b'data', len = 4
      args = <local> (b'corpus/Axxon.MixedMedia-batch.1/Informational_pershy_na_seli_20191227/Informational_pershy_na_seli_20191227_00117_1090.07-1092.72', array([[-1.3290231 , -1.1278104 , -1.1955391 , ..., -2.1946402 ,
                             -2.2877762 , -2.5685613 ],
                            [-1.1158917 , -1.0490893 , -1.1935912 , ...,  1.1074646..., _[0]: {len = 129}
      self = <local> <ExternSprintDataset 'dataset_id140408180674512' epoch=1>
      self._read_next_raw = <local> <bound method ExternSprintDataset._read_next_raw of <ExternSprintDataset 'dataset_id140408180674512' epoch=1>>
  File "/nas/models/asr/wmichel/setups/debugging/returnn/returnn/datasets/sprint.py", line 864, in ExternSprintDataset._read_next_raw
    line: data_type, args = Unpickler(stream, encoding="bytes").load()
    locals:
      data_type = <not found>
      args = <not found>
      Unpickler = <global> <class '_pickle.Unpickler'>
      stream = <local> <_io.BytesIO object at 0x7fb353b7f0a0>
      encoding = <not found>
      load = <not found>
UnpicklingError: invalid load key, '9'.

May 17 '22 14:05 michelwi

I have been playing around a bit trying to create a reproducible minimal setup and some other errors that came up were

UnpicklingError: could not find MARK
UnpicklingError: unpickling stack underflow

all at the same position in the code.

May 17 '22 14:05 michelwi

returnn returnn copied to clipboard

ExternSprintDataset with non-ascii speaker names

returnn
returnn copied to clipboard