returnn icon indicating copy to clipboard operation
returnn copied to clipboard

Arbitrary Data Format for HDFDataset

Open JackTemaki opened this issue 4 years ago • 15 comments

It seems to me that the HDFDataset does only support data streams with a single time-axis, and not data with no or multiple time axes. I know that I can still add data without time axis, by just using a sequence length of one, and then remove the time axis in a layer in the network.

Are there any possibilities right now to load data with 2 time axes (e.g. Images) or also 1 time axis and 1 fixed spatial dimension into RETURNN? And if not, are there any plans yet to change that?

JackTemaki avatar Aug 11 '20 07:08 JackTemaki

Yes, I do that a couple of times, and there are also some demos showing how to do that. You can simply flatten multiple dynamic axes into a single axis. When you dump to HDF (e.g. via HDFDumpLayer), this will be done automatically, and it will also save the information needed to unflatten it later. Then there is the UnflattenNdLayer, which unflattens the input. Search the tests for unflatten_nd to see some demos.

For fixed spatial dimensions, we do that very often, in a similar way. In that case, we merge it with the feature dim. E.g. when Sprint adds a window. Then you can simply use SplitDimsLayer.

albertz avatar Aug 11 '20 08:08 albertz

Okay. I would still leave this Issue open, as it would be nice to support more data-formats in the datasets itself in the future, without having to rely on extra layers.

JackTemaki avatar Aug 12 '20 15:08 JackTemaki

I agree, we should simplify it for the user as far as possible. HDFDumpLayer and HDFDataset should complement each other in this case. (We should be a bit careful to not break old setups, though, when we change the behavior now.)

Another use case which came up a couple of times is to store a search beam in the HDF, e.g. N target hypotheses for one sequence. E.g. this StackOverflow question (@npwynands @michelwi).

Also see #311, the example with _align_dump.

albertz avatar Aug 15 '20 12:08 albertz

For reference, here is some discussion relevant for this.

albertz avatar Feb 05 '21 22:02 albertz

A note: I think our Dataset API itself is not really flexible enough.

The Dataset.get_data_shape function currently implies that there is always exactly one dynamic time axis in the beginning. So this is not only about HDFDataset but also about Dataset itself.

To make this really more generic, and support any number of dynamic axes by the dataset, we also would need sth equivalent to DimensionTag (but maybe simpler), to identify which of the dynamic time axis are the same and which are different. (Note that this is tricky to get right for all our existing datasets, because we never store this information explicitly. This is always implicitly known. You only notice in framewise training that you get strange errors when it doesn't fit...)

albertz avatar Mar 02 '21 11:03 albertz

I want to revive this issue, because we again ran into some trouble with the default HDF Dataset (writing sparse data with the SimpleHDFWriter). After reviewing some of the code, I think it would be best to have a closer look at how the NextGenHDFDataset is handling things, as it allows for a much cleaner way to create HDF files. While it certainly has some issues (e.g. the current parser naming), I think the concept of having such parsers is nice, as you can have custom parsers in your config. This would allow e.g. to store data with arbitrary compression algorithms inside the hdf and have a custom decompression logic.

Also, it has a cleaner and more consistent handling of multiple streams, compared to the unintuitive data vs targets split in the normal HDF.

Maybe one issue could be speed, as the NextGenHDF stores every sequence as its own dataset, which is maybe not a good idea in terms of performance, but I will do some tests with that.

I definitely do not want to extend/fix the code around the regular HDFDataset, as the codebase for the writer is already unnecessarily large and complex. Maybe even a clean re-write of any HDF logic would be best.

JackTemaki avatar Apr 13 '23 08:04 JackTemaki

Despite the data / targets split (which is only a minor annoyance), I don't really see much problems with our current format.

But what you describe just sounds more like a minor bug in SimpleHDFWriter? Certainly writing sparse data is very much supported.

Also you say that our code is unnecessarily large and complex. Why don't we fix that? Inventing things over and over, writing new code again and again only that the new code also becomes complex then is not really such a good solution.

Instead of inventing yet another own custom format, and having a dataset implementation for that, I would suggest that we maybe use some of the existing very common formats. E.g. what HuggingFace uses for almost all datasets. It is based on Apache Arrow. See here.

In your particular case, I would suggest to just fix the problem in SimpleHDFWriter. This sounds just like a small bug which is probably easy to fix.

albertz avatar Apr 13 '23 08:04 albertz

Why don't we fix that?

Because our policy of keeping full backwards compatibility prohibits that in many cases, and the personal capacity is not there. You know I am always in favor of cleaning up. But lets not discuss the problem of refactoring, this tackles some very fundamental issues that are not suited to be discussed here.

Inventing things over and over Instead of inventing yet another own custom format

This is not what I planned here, the code already exists within RETURNN. The question is if it is a good idea to use it an if it needs some updates.

Instead of inventing yet another own custom format, and having a dataset implementation for that, I would suggest [...] existing very common formats [...] E.g. Apache Arrow

Now you are contradicting yourself. HDF is not an own custom format, and also integrating Apache Arrow would mean having new code and basically a rewrite of all the stuff we have (also in the recipes).

This sounds just like a small bug which is probably easy to fix.

Maybe yes, but I do not like SimpleHDFWriter in the first place. It was never intended to be used outside of RETURNN, and I still think it was a bad decision of me to base some recipe jobs on it.

JackTemaki avatar Apr 13 '23 08:04 JackTemaki

Because our policy of keeping full backwards compatibility prohibits [cleaning up] in many cases ...

Are you talking about HDFDataset or about SimpleHDFWriter? I thought you just talk about SimpleHDFWriter? And that code is not really so complex or complicated.

But if you really find that too complicated, and you think we should not break code which depends on it (this is arguable, as it is more like an internal class), we could also just have a new EvenSimplerHDFWriter, and still could stick to the format, so no need to reimplement HDFDataset or to invent a new format. Or add another helper function on SimpleHDFWriter. I actually don't really know or understand what you find difficult about it.

We should maybe be a bit more specific on what the actual problem is now in your case. I don't really think that there is such a big problem here. It's probably a very easy fix.

This is not what I planned here, the code already exists within RETURNN. The question is if it is a good idea to use it an if it needs some updates.

My understanding was that "some updates" basically means "inventing things over and over" here, as those "updates" probably just reflect what we already did in HDFDataset or elsewhere. I'm not sure how much changes or updates are needed.

Or even more, from your comment regarding performance, it sounds like this might need a fundamental separate new dataset implementation, again with a new format, so again "inventing things over and over".

HDF is not an own custom format

HDF is just a container, like a ZIP file, or also like Apache Arrow (I think, at least that is my understanding). The way how you store things inside a HDF can be very different, so you certainly have different formats there. That's the difference between HDFDataset and NextGenHDFDataset. They have different formats, both using HDF.

Also I did not mean to use Apache Arrow directly. My point was to use an existing format, which is defined by the HuggingFace datasets, which is based on Apache Arrow, just like HDFDataset is based on HDF.

We anyway wanted to have a RETURNN dataset wrapper for HuggingFace datasets (#1257) (which is trivial; and there already exists an implementation by @dthulke). So why bother with updating NextGenHDFDataset or with inventing yet another HDF-based format, and not just use a very widely-used dataset format.

The question is on amount of maintenance-work and one-time work. We probably anyway will maintain HDFDataset for some time. NextGenHDFDataset is basically unmaintained. So your suggestion to update NextGenHDFDataset would mean that the total maintenance-work will increase, as then we would need to maintain both HDFDataset and NextGenHDFDataset. Using HuggingFace datasets on the other side would not really be any maintenance work on our side, as that is all handled by HuggingFace. Or just invest a little one-time work into fixing the problem in SimpleHDFWriter.

SimpleHDFWriter is also used by our HDFDumpLayer, which is actually used quite a lot. So it anyway makes sense to fix that.

albertz avatar Apr 13 '23 11:04 albertz

And that code is not really so complex or complicated.

I disagree on that one, it is more complex that needed.

We anyway wanted to have a RETURNN dataset wrapper for HuggingFace datasets

So you prefer we switch all code to HuggingFace datasets, okay, but this means we have three options then. Also please take into account how people will actually adapt to this, all personal tools related to HDFs would be obsolete.

But also after reading into the huggingface dataset, it does not seem it is really what we need here. For me it seems it is rather a convenience API to load datasets of whatever format (in the simplest case just folders) and have them accessible in a consistent way. What we need is a fully standardized and high performance data storage file format, and here I do not see what speaks against HDF.

or with inventing yet another HDF-based format

Again, no one said that. But I it would be nice to have customization options for some details of that format via the config.

JackTemaki avatar Apr 13 '23 11:04 JackTemaki

I prefer using an existing widely-used dataset format (HuggingFace dataset was just a suggestion or example for that) over NextGenHDFDataset (because unmaintained, maybe partly incomplete) or over inventing yet another own custom format.

I do not say that I prefer to use HuggingFace datasets over our HDFDataset. At least not for now. We first need some more experience with that.

I also do not say that we necessarily should use the HuggingFace dataset format. I just say I prefer a widely-used dataset format. That's just the only example I currently know.

Btw, what you read is probably on just the HuggingFace dataset API/Hub (all what you find here) but I mean specifically the HuggingFace dataset format based on Apache Arrow (or Apache Parquet?), which they use for many of the datasets. In most cases, they provide some scripts which load the dataset and that gets converted into their Apache Arrow based format, which is fully standardized and high performance. There is save_to_disk/load_from_disk (Apache Arrow based), to_parquet/from_parquet (Apache Parquet based), as far as I remember (I had some discussion on Librispeech here: https://github.com/huggingface/datasets/issues/4185). But I don't really know much about the details on their format (or formats?).

The actual HuggingFace dataset format does also not matter too much, as they provide many ways how you can convert any data into their format. So you would probably use one of the intended ways how to do that. E.g. like almost all of their datasets are implemented. See for example Librispeech.

albertz avatar Apr 13 '23 12:04 albertz

The actual HuggingFace dataset format does also not matter too much, as they provide many ways how you can convert any data into their format. So you would probably use one of the intended ways how to do that. E.g. like almost all of their datasets are implemented. See for example Librispeech.

This is not what we need the HDFDataset for. This is needed in our pipeline for converting arbitrary data, e.g. from RASR caches. Of course we could write a custom loader for the Huggingface dataset to load RASR caches. Or we could write a loader for HDF files with the huggingface dataset. But we are talking here about binary dataset formats. So HDF, npy/npz, protobuf etc.

And HDF is a standard, we have experience with that, we have existing scripts for that. We want to transfer single files, so multi file datasets are not good. So now it is only about the internal structuring which should be fully customizable. It should be possible to use HDFs with as few assumptions as possible.

So given your statement our option is to keep using the HDFDataset or spend some work on integrating huggingface with so far no experience or even knowledge about binary formats, which potentially takes weeks to finish when no one puts priority on that. Neither is an option I want to work with.

I guess for short term convenience we will use the NextGenHDFDataset as is, and revise this topic completely when a more general direction is clear.

JackTemaki avatar Apr 13 '23 16:04 JackTemaki

I'm not really sure I understand what you mean. We would just have a generic HuggingFaceDataset RETURNN dataset. This is https://github.com/rwth-i6/returnn/issues/1257. And this is very simple. Actually, we already have it (via @dthulke), it's just not pushed to master yet. This is for reading HuggingFace datasets. This includes both their binary formats (Apache Arrow, Apache Parquet, whatever else they have, maybe even HDF) and also any other of their datasets. It does not take weeks. We already have that. It's very simple. Maybe speak with @dthulke on how simple this is. Also, we probably will anyway have this, totally independent of the issue here, as it is very convenient to use all the existing HuggingFace datasets from their hub.

Apache Arrow, Apache Parquet is just as standard as HDF. ZIP files are also pretty widely used. The internal structure in all cases is fully customizable, so it's important to have a well defined format inside these container formats. Like the HuggingFace dataset format. Or our HDFDataset.

HDF can be a single file or multiple files. I think for Apache Arrow or Apache Parquet, this is just the same. It's probably just a matter of configuration. (In HF datasets, I think this is an option of save_to_disk.)

But anyway, I'm not stopping you from using and extending NextGenHDFDataset. I just think that this will overall lead us to more maintenance work than any of the other options.

albertz avatar Apr 13 '23 18:04 albertz

This is for reading HuggingFace datasets

This is not the problem here. I do not doubt the integration is simple, but this does not solve the problem that we have the need for a file-based transfer format.

Please see the following links. https://huggingface.co/docs/datasets/loading#local-and-remote-files https://huggingface.co/docs/datasets/about_arrow#memorymapping https://github.com/huggingface/datasets/issues/3113

All of this hints to that Huggingface datasets do NOT solve our problem. Arrow is only used for local caching and not even listed as input type.

JackTemaki avatar Apr 13 '23 19:04 JackTemaki

I still don't exactly understand what problem you are actually talking about (despite some small bug in SimpleHDFWriter). You ask about how to convert any of our data into a HuggingFace dataset? But that's also what I already answered: You would simply use one of the provided tools. Or you would simply write a HuggingFace dataset wrapper around your data, and then use functions like save_to_disk or to_parquet. We could also even write a HuggingFace dataset wrapper which wraps any RETURNN dataset. In any case, getting data into the HuggingFace dataset format should be just as easy.

So then you have your file-based transfer format. You just transfer whatever this save_to_disk or to_parquet (or maybe other tools or functions) give you. Where is the problem?

albertz avatar Apr 13 '23 19:04 albertz