lhotse Tune lilcom compression settings

With current tick_power of -5 we get ~70% reduction in stored features size (using HDF5), but the overall feature size is generally comparable to the size of the original recordings (e.g. 4.8GB of OPUS recordings is 4.3GB of stored features). We never really tried tweaking the compression settings, maybe there is some easy win there.

Jul 08 '21 19:07 pzelasko

Preliminary check: it looks like we could gain a lot by decreasing the lilcom tick size further (currently, we use -5 as default). I stored mini_librispeech dev-clean-2 features on disk with different settings and measured the L1 distance between the comrpessed and uncompressed features, and the disk space occupied with LilcomHdf5Writer.

I think I will run ASR decoding with feats compressed at a different level and measure the WER difference... that should be a very convincing objective when choosing the right tick size.

Tick	Rel.abs.change	Disk size
0	-3.42%	32M
-1	-1.71%	38M
-2	-0.85%	45M
-3	-0.43%	52M
-4	-0.21%	59M
-5	-0.11%	80M
-6	-0.05%	80M
-7	-0.03%	80M
-8	-0.01%	96M

Jul 22 '21 12:07 pzelasko

Mm, if the features are not smaller than the Opus recordings we could consider doing feature generation on the fly? Lilcom decompression is not super fast, I think.

On Thu, Jul 22, 2021 at 8:29 PM Piotr Żelasko @.***> wrote:

Preliminary check: it looks like we could gain a lot by decreasing the lilcom tick size further (currently, we use -5 as default). I stored mini_librispeech dev-clean-2 features on disk with different settings and measured the L1 distance between the comrpessed and uncompressed features, and the disk space occupied with LilcomHdf5Writer.

I think I will run ASR decoding with feats compressed at a different level and measure the WER difference... that should be a very convincing objective when choosing the right tick size. Tick Rel.abs.change Disk size 0 -3.42% 32M -1 -1.71% 38M -2 -0.85% 45M -3 -0.43% 52M -4 -0.21% 59M -5 -0.11% 80M -6 -0.05% 80M -7 -0.03% 80M -8 -0.01% 96M

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/336#issuecomment-884873762, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZGQDRA4NHNSKXKNGTTZAFKXANCNFSM5ABKZ27A .

Jul 22 '21 13:07 danpovey

Yeah, this is another direction that I'm concurrently pursuing.

Still there are scenarios when the I/O is costly enough (and storage cheap enough) that it's preferable to pre-compute the features, so I'm going to check if we can save a bit of space here.

Jul 22 '21 13:07 pzelasko

As to lilcom compression being slower -- I didn't measure exactly, but reading precomputed features is way faster than on-the-fly reading of audio chunks from OPUS. With HDF5 + lilcom, a 4-worker-per-GPU DataLoader for GigaSpeech didn't even sweat; with on-the-fly extraction, there seems to be a big overhead on reading data from disk (likely it's the sequential launching of ffmpeg subprocesses that seek a chunk of data). With a batch size of max_duration=550s, it takes something between 20 and 27 seconds to read all of the data sequentially, and about half a second to compute the features (also sequentially on CPU).

Regarding solutions, there are a few:

Launch a crazy amount of workers to compensate (I don't know yet when it starts becoming efficient, 16 workers per GPU when using 4 GPUs is not enough, there is a 250~300ms average delay to grab a new batch);
Ideally we would parallelize the audio reads, but it's not trivial -- I tried to use a ThreadPoolExecutor but due to GIL the max speedup is 2x; ProcessPoolExecutor won't work because DataLoader spawns workers as deamon subprocesses, and daemon subprocesses can't spawn their own subprocesses. If we use num_workers=0 for the DataLoader, we can parallelize the reading easily (a pool of 16 subprocesses reads everything in ~2 seconds), but that's still 2.5-3 seconds overhead on each batch. We would need to execute that in another subprocess to add prefetching, so effectively it's writing our own DataLoader.
There is a new version of data API being created in PyTorch called DataPipes, maybe the new design will make it simpler to address some of these issues (@mthrok any thoughts?)

Note that the reason we have this issue is because we are using dynamic batch sizes; we only know how large the batch is going to be after we've collected enough examples to exceed some threshold of their cumulative duration. Then the collation (concatenating cuts, mixing in noise, etc.) is actually done lazily before performing any I/O, just based on the metadata -- and after that we read + mix everything. So ideally what should happen in the "background" worker is some sort of concurrent I/O, and then collation of audio samples (with padding/concatenation), applying audio transforms, extracting features, applying feature transforms.

Bright sides:

the whole issue of on-the-fly efficiency did not exist e.g. for LibriSpeech -- the current setup is fast enough to do on-the-fly reading + feature extraction for LibriSpeech
moving the feature computation to GPU in the training loop / model is straightforward with either KaldiFbank or kaldifeat; but it's also not the bottleneck right now
no memory blow-ups with the new chunk-reading mechanism for OPUS I added here #339 (ffmpeg also seems very fast when I launch it for single files, even 8h long, in the CLI -- I suspect the overhead is more in launching Python subprocesses?)
features can be efficiently precomputed for chunks of audio (e.g. isolated utterances or fixed size cuts), if there is enough storage then it's not an issue to use precomputed feats ATM (although it's still slow/blows up memory if we do it in parallel on full 8h recordings, I haven't addressed that so far, but I'd rather get on-the-fly working efficiently)

Jul 22 '21 22:07 pzelasko

... I might have a solution in #343 but not tested on GigaSpeech yet.

Jul 23 '21 03:07 pzelasko

@VitalyFedyunin Do the DataLoader parallelism updates seem applicable here?

Jul 23 '21 04:07 dongreenberg

IMHO, I think this is a great example to use modular DataPipes. It could take one sequence of DataPipes to read metadata and connect it (with some routers) to sequences of DataPipes to performe I/O using multithreading/multiprocesses. But, since you are reading data based on metadata, it seems eventually data reading becomes blocking style even with the concurrent I/O.

Jul 23 '21 13:07 ejguan

Sounds interesting! I'm not sure I got the point about blocking style data reading due to metadata -- we can (fairly easily I guess) split the metadata manifests into pieces (manifest.1.jsonl.gz, .manifest.2.jsonl.gz, and so on) and read each piece in a separate DataPipe instance. Does that make more sense with the new design?

Jul 23 '21 14:07 pzelasko

Correct me if I am wrong. metadata datapipe can be treated as a Sampler to provide indexes or file names. Then, the data-reading DataPipe would take them as inputs and start I/O. All of these steps are non-blocking. But, as data yielded from data-reading DataPipe is unordered due to concurrent I/O and multiprocessing, do you expect to yield data with the order instructed by metadata. For example, metadata requests to read [file1,file2,file3], [file4, file5], .... If it takes really long time to read file2, are you going to yield data from [file1, file3, file4] rather than waiting for file2?

Anyway, even if this is the case, I still think DataPipe could provide better syntax to implement your data loading pipeline as it's more flexible with data pipeline and choice of where you want multiprocessing.

Jul 23 '21 15:07 ejguan

Cool, I understand now. You got our workflow right -- I think the only difference is that our metadata pipeline would read a jsonl file (either pre-load it entirely, or dynamically item-by-item) and instead of indices/filenames, it would output metadata dataclasses (or dicts/json-strings, if custom objects are not allowed).

I think you are right about the blocking bit -- we'd have to wait for file3 in your example, as at the point of performing I/O we would have already determined how the batch is formed (our examples are batched in a non-standard way -- we often append the examples together and mix them with noise/music audio, so the actual batch size is smaller than the number of examples in a batch).

For future reference, the workflow seems to be something like:

I don't understand all the details of DataPipes yet, but it seems like it's possible we could benefit from them -- also for the sampling part, as I think our custom samplers are becoming slightly too complex at this point.

Jul 23 '21 16:07 pzelasko

Also some examples of how the batches look like in our custom setup can be seen here.. https://github.com/lhotse-speech/lhotse/pull/234

Jul 23 '21 16:07 pzelasko

I will try to write an example for you about how it's going to be based on my understanding of this workflow.

Jul 23 '21 18:07 ejguan

Here is the example: https://gist.github.com/ejguan/e9a2ac94c276babae76f7dbd2a251180 Really appreciate any feedback or question.

Jul 26 '21 15:07 ejguan

Thanks @ejguan! This looks pretty cool! I think I understand the whole flow.

The GroupBy operator and matching by batch index is a cool idea -- but how does GroupBy "know" it has gathered all the possible elements? I.e. how does it know not to wait indefinitely?

Also, which process/thread is the core datapipe flow executed in? I assume in your example, the behaviour is like with DataLoader(num_workers=0), except for the I/O part which is explicitly distributed. Is it possible for the whole pipeline to also be ran concurrently to the training loop?

Is there some nightly pytorch version that I could use to try and build a draft of an actual processing pipeline with this example?

Jul 27 '21 16:07 pzelasko

@pzelasko

but how does GroupBy "know" it has gathered all the possible elements? I.e. how does it know not to wait indefinitely?

That's actually a great feedback. I was thinking you can implement a custom GroupBy. But, if this is something a core function in audio data-loading process, we should definitely implement a groupby for this use case.

Is it possible for the whole pipeline to also be ran concurrently to the training loop? Yes.

Is there some nightly pytorch version that I could use to try and build a draft of an actual processing pipeline with this example?

We do have several basic functional datapipe in the core library (https://github.com/pytorch/pytorch/tree/master/torch/utils/data/datapipes/iter). We are still gathering more requests from different domains about what kind of modular DataPipe they need based on the real-world data pipeline. Could you try to go over them and see if there is any missing functional DataPipe to revamp your DataSet functionality.

I will ask team about how to incorporate the private repo.

Jul 27 '21 21:07 ejguan

lhotse lhotse copied to clipboard

Tune lilcom compression settings

lhotse
lhotse copied to clipboard