lhotse icon indicating copy to clipboard operation
lhotse copied to clipboard

Can't save a cutset with features in memory

Open RuABraun opened this issue 1 year ago • 5 comments

If I try and save it as is I get Object of type bytes is not JSON serializable.

If I try and use copy_feats() to extract the features I get:

    new_key = writer.write(self.storage_key, feats)
  File "lib64/python3.7/site-packages/lhotse/features/io.py", line 281, in write
    subdir = self.storage_path_ / key[:3]
  File "/usr/lib64/python3.7/pathlib.py", line 925, in __truediv__
    return self._make_child((key,))
  File "/usr/lib64/python3.7/pathlib.py", line 704, in _make_child
    drv, root, parts = self._parse_args(args)
  File "/usr/lib64/python3.7/pathlib.py", line 666, in _parse_args
    % type(a))
TypeError: argument should be a str object or an os.PathLike object returning str, not <class 'bytes'>

If I fix that I get other errors like

  File "lib/python3.10/site-packages/lhotse/features/base.py", line 543, in copy_feats
    new_key = writer.write(self.storage_key, feats)
  File "/lib/python3.10/site-packages/lhotse/features/io.py", line 286, in write
    output_features_path = (subdir / key).with_suffix(".llc")
TypeError: unsupported operand type(s) for /: 'PosixPath' and 'bytes'

RuABraun avatar Aug 22 '23 17:08 RuABraun

It looks like you might have some data loaded in memory. Can you share more about the context of your usage? Are you using webdataset/shar or functions such as move_to_memory?

EDIT: Just noticed the issue topic; I’ll try to repro later and get back.

pzelasko avatar Aug 22 '23 21:08 pzelasko

So the purpose was to go back from a webdataset to a cutset.

On second thought though, maybe not necessary actually to keep features around when back in cutset format (for the specific usecase I have in mind). But regardless seems to me like with a few changes copy_feats() should work (if I understand correctly it will save the feats to disk and set storage_path), all that needs to be done is something other than storage_key to be used as the key.

RuABraun avatar Aug 23 '23 15:08 RuABraun

I think the issue might be that you also have audio data that is in memory and can't be stored in JSONL. You can either do cuts = cuts.drop_recordings() before calling copy_feats, or you can use the new copy_data function that I just added if you want to convert webdataset -> "normal" data (here: https://github.com/lhotse-speech/lhotse/pull/1130).

You can check with the following snippet (assuming you have some small input cut set for testing):

from pathlib import Path
from tempfile import TemporaryDirectory

from lhotse import *
from lhotse.dataset import export_to_webdataset


root = Path(...)
cuts = CutSet.from_file(root / "libri-train-5.jsonl.gz").subset(first=10)

with TemporaryDirectory() as d:
    d = Path(d)
    wds = str(d / "cuts.tar")

    cuts = cuts.compute_and_store_features(Fbank(), d / "orig_feats")
    print(cuts[0])

    export_to_webdataset(cuts, wds)

    cuts_mem = CutSet.from_webdataset([wds])
    print(cuts_mem[0])

    jsl = d / "cuts.jsonl.gz"

    cuts_jsl = cuts_mem.copy_data(d / "datacopy")

    print(cuts_jsl[0])
    print(cuts_jsl[0].load_audio())
    print(cuts_jsl[0].load_features())

    jsl2 = d / "cuts2.jsonl.gz"

    with LilcomChunkyWriter(d / "feats") as w:
        cuts_jsl2 = cuts_mem.drop_recordings().copy_feats(w, jsl2)

    print(cuts_jsl2[0])
    print(cuts_jsl2[0].load_audio())
    print(cuts_jsl2[0].load_features())

pzelasko avatar Aug 24 '23 15:08 pzelasko

Actually in that CutSet I had no info on the recording at all. If I try and drop the features and then save I get

Cannot detach features from a DataCut with no Recording 

Now have a cutset with recordings. This I can save by dropping features. But copy_feats still crashes with this code:

stream = tarfile.open(inf, mode="r")
cuts = []
for tarinfo in stream:
    fname = tarinfo.name
    data = stream.extractfile(tarinfo).read()
    data = pickle.loads(data)
    item = deserialize_item(data)
    #breakpoint()
    cuts.append(item)
cutset = CutSet.from_cuts(cuts)
writer = LilcomFilesWriter('feats')
#cutset = cutset.drop_recordings()  <-- makes no difference
cutset = cutset.copy_feats(writer, 'feats2.jsonl.gz')
cutset.to_file('cuts_fbank.jsonl.gz')

leads to TypeError: unsupported operand type(s) for /: 'PosixPath' and 'bytes'

But copy_data seems to have worked!

RuABraun avatar Sep 05 '23 16:09 RuABraun

In the example posted above copy_feats had worked so I can't really replicate your case. The only other thing I can think of right now is that you might have had custom fields with Array manifests and in-memory data.

pzelasko avatar Sep 05 '23 16:09 pzelasko