lhotse
lhotse copied to clipboard
Can't save a cutset with features in memory
If I try and save it as is I get Object of type bytes is not JSON serializable
.
If I try and use copy_feats()
to extract the features I get:
new_key = writer.write(self.storage_key, feats)
File "lib64/python3.7/site-packages/lhotse/features/io.py", line 281, in write
subdir = self.storage_path_ / key[:3]
File "/usr/lib64/python3.7/pathlib.py", line 925, in __truediv__
return self._make_child((key,))
File "/usr/lib64/python3.7/pathlib.py", line 704, in _make_child
drv, root, parts = self._parse_args(args)
File "/usr/lib64/python3.7/pathlib.py", line 666, in _parse_args
% type(a))
TypeError: argument should be a str object or an os.PathLike object returning str, not <class 'bytes'>
If I fix that I get other errors like
File "lib/python3.10/site-packages/lhotse/features/base.py", line 543, in copy_feats
new_key = writer.write(self.storage_key, feats)
File "/lib/python3.10/site-packages/lhotse/features/io.py", line 286, in write
output_features_path = (subdir / key).with_suffix(".llc")
TypeError: unsupported operand type(s) for /: 'PosixPath' and 'bytes'
It looks like you might have some data loaded in memory. Can you share more about the context of your usage? Are you using webdataset/shar or functions such as move_to_memory?
EDIT: Just noticed the issue topic; I’ll try to repro later and get back.
So the purpose was to go back from a webdataset to a cutset.
On second thought though, maybe not necessary actually to keep features around when back in cutset format (for the specific usecase I have in mind). But regardless seems to me like with a few changes copy_feats()
should work (if I understand correctly it will save the feats to disk and set storage_path
), all that needs to be done is something other than storage_key
to be used as the key.
I think the issue might be that you also have audio data that is in memory and can't be stored in JSONL. You can either do cuts = cuts.drop_recordings()
before calling copy_feats
, or you can use the new copy_data
function that I just added if you want to convert webdataset -> "normal" data (here: https://github.com/lhotse-speech/lhotse/pull/1130).
You can check with the following snippet (assuming you have some small input cut set for testing):
from pathlib import Path
from tempfile import TemporaryDirectory
from lhotse import *
from lhotse.dataset import export_to_webdataset
root = Path(...)
cuts = CutSet.from_file(root / "libri-train-5.jsonl.gz").subset(first=10)
with TemporaryDirectory() as d:
d = Path(d)
wds = str(d / "cuts.tar")
cuts = cuts.compute_and_store_features(Fbank(), d / "orig_feats")
print(cuts[0])
export_to_webdataset(cuts, wds)
cuts_mem = CutSet.from_webdataset([wds])
print(cuts_mem[0])
jsl = d / "cuts.jsonl.gz"
cuts_jsl = cuts_mem.copy_data(d / "datacopy")
print(cuts_jsl[0])
print(cuts_jsl[0].load_audio())
print(cuts_jsl[0].load_features())
jsl2 = d / "cuts2.jsonl.gz"
with LilcomChunkyWriter(d / "feats") as w:
cuts_jsl2 = cuts_mem.drop_recordings().copy_feats(w, jsl2)
print(cuts_jsl2[0])
print(cuts_jsl2[0].load_audio())
print(cuts_jsl2[0].load_features())
Actually in that CutSet I had no info on the recording at all. If I try and drop the features and then save I get
Cannot detach features from a DataCut with no Recording
Now have a cutset with recordings. This I can save by dropping features. But copy_feats
still crashes with this code:
stream = tarfile.open(inf, mode="r")
cuts = []
for tarinfo in stream:
fname = tarinfo.name
data = stream.extractfile(tarinfo).read()
data = pickle.loads(data)
item = deserialize_item(data)
#breakpoint()
cuts.append(item)
cutset = CutSet.from_cuts(cuts)
writer = LilcomFilesWriter('feats')
#cutset = cutset.drop_recordings() <-- makes no difference
cutset = cutset.copy_feats(writer, 'feats2.jsonl.gz')
cutset.to_file('cuts_fbank.jsonl.gz')
leads to TypeError: unsupported operand type(s) for /: 'PosixPath' and 'bytes'
But copy_data
seems to have worked!
In the example posted above copy_feats had worked so I can't really replicate your case. The only other thing I can think of right now is that you might have had custom fields with Array manifests and in-memory data.