Tables.jl icon indicating copy to clipboard operation
Tables.jl copied to clipboard

Start an idea of what an "in memory" requirement would look like

Open quinnj opened this issue 3 years ago • 51 comments

This has been requested/discussed a number of times; here's one idea of what this could look like. Basically the same as Tables.rows, but Tables.indexablerows would require an "indexable" object of rows to be returned instead of just an iterator. Indexable is a little vague; to be most useful, we should probably require the return object to be AbstractVector since we get lots of fancy indexing/useful behavior that way. The bare minimum indexing interface is just getindex, firstindex, and lastindex, but it seems like people would then just be wanting to do x[[i, j, k]] like operations and have to implement their own. So I'm inclined to make the requirement that you have to return an AbstractVector of rows.

quinnj avatar Mar 12 '22 07:03 quinnj

Couple of additional thoughts:

  • I've changed RowIterator in this PR to subtype AbstractVector{ColumnsRow{T}}, which means that all column tables would "just work" automatically for Tables.indexablerows as is defined in this PR
  • Currently, if the result of Tables.rows isn't an AbstractVector, then we throw an error in the fallback definition. I wonder if it would be more useful to just call Tables.rowtable instead? Like, I'm curious what potential users of Tables.indexablerows would do if the input ends up not being indexable; do they want to surface that error back to the user so throwing is fine? Or would they rather just "coerce" the input to an indexable rows object via Tables.rowtable.

quinnj avatar Mar 12 '22 07:03 quinnj

Collecting a rowtable from an iterator can be a costly operation and I'd want to know when I'm doing it, so I'd rather get an error.

jariji avatar Mar 12 '22 07:03 jariji

As commented on Discourse I think we should require Tables.indexablerows to support AbstractVector API but it does not have to be AbstractVector. I would say in the docs that it is strongly recommended but not required (like with Tables.istable - it is recommended but not required).

What I mean is that if you subtype AbstractVector then it is easy to create such a type (there is a short and well defined list of methods you need to define). If you decide not to subtype AbstractVector then you need to implement many methods from Base and I think most of the package maintainers will have a very hard time to do it properly and keep consistency with Base Julia when it evolves.

I would also require firstindex to return 1 (i.e. I think it is safer to require 1-based indexing)

However, if @juliohm and @ablaom will find it acceptable to just require AbstractVector I think it would simplify things.

CC @nalimilan

bkamins avatar Mar 12 '22 10:03 bkamins

Thanks for working on this @quinnj! I agree with @bkamins that it doesn't seem necessary/useful to require returned objects to subtype AbstractVector, as you're unlikely to dispatch on that: implementing the interface is enough. This is more in line with the fact that the result of Tables.columns is not required to subtype AbstractColumns. And in the Discourse discussion, having to subtype AbstractVector was mentioned as problematic by @juliohm.

Currently, if the result of Tables.rows isn't an AbstractVector, then we throw an error in the fallback definition. I wonder if it would be more useful to just call Tables.rowtable instead? Like, I'm curious what potential users of Tables.indexablerows would do if the input ends up not being indexable; do they want to surface that error back to the user so throwing is fine? Or would they rather just "coerce" the input to an indexable rows object via Tables.rowtable.

Given that currently the Tables.columns fallback already makes a copy if necessary to return a suitable object, I'd say that for consistency indexablerows should do the same. But maybe a function to check whether a table is known to support indexing would also be useful, similar to rowaccess/columnaccess? Then if you don't want to make a copy you could check this trait first. It would be worth mentioning in the docstring whether types are expected to make a copy of the data if needed or whether they'd better error in that case, as this is indeed a hard decision to make, in particular for table types pointing to potentially large out of memory datasets.

nalimilan avatar Mar 12 '22 22:03 nalimilan

Thank you all for putting this proposal together. I would like to double check with you if I am understanding things correctly.

Consider the following abstract Meshes.Data type that I rely on in Meshes.jl to represent tables over geospatial domains:

https://github.com/JuliaGeometry/Meshes.jl/blob/master/src/traits/data.jl

This abstract type implements the Tables.jl interface and the main idea there is the following: if a user of the ecosystem has data that can be mapped to a geospatial domain like a mesh, image, geographic map, he/she can simply inherit from the abstract Meshes.Data type and implement two functions: domain and values. The parent type will have fallbacks for plot recipes, geospatial APIs, etc.

On the other hand, when I call Tables.rows(data::Data) I return a Meshes.DataRows object. From what I understood, the proposal consists of requesting Meshes.DataRows <: AbstractVector, is that correct? If that is the plan, I think it should be fine.

juliohm avatar Mar 13 '22 10:03 juliohm

From what I understood, the proposal consists of requesting Meshes.DataRows <: AbstractVector, is that correct?

At least this was my original proposal. Your table does not have to be AbstractVector, but the "row view" of it should and can be AbstractVector as they most likely anyway need to be different types. Still you can make Meshes.DataRows to also be a Tables.jl table if you want (most likely you do want it).

bkamins avatar Mar 13 '22 13:03 bkamins

implementing the interface is enough

@nalimilan, @bkamins, the problem IMO is that I think it's going to be too hard/complex to expect users to correctly implement all the possible indexing operations. Maybe I'm wrong here and it's not actually too hard, but from a brief glance, they'd have to implement getindex(x, ::AbstractVector{<:Integer}), bool vector indexing, etc. It feels like requiring the subtype of AbstractVector would ensure better consistency. But then again, maybe I'm over-worrying and most will subtype AbstractVector anyway and those who can't are savvy enough to correctly implement indexing.....

quinnj avatar Mar 13 '22 16:03 quinnj

I guess part of my hesitation there is the "incompleteness" of the full indexing interface; I understand the "basic" indexing interface, but that only allows single x[i]. Like, what are the full list of indexing methods I get for free when I subtype AbstractVector; because that's the list someone will have to implement manually if they don't subtype AbstractVector and I think we at least owe them what that full list that we expect to be implemented and that users can rely on.

quinnj avatar Mar 13 '22 16:03 quinnj

I think we at least owe them what that full list that we expect to be implemented and that users can rely on.

That was my point above. This is doable, but it is super hard. The users would need to additionally handle things like: view, broadcasting, similar, copy! (and related functions), keys, pairs, values, ...

In my experience if someone would implement a non-AbstractVector type that would want to mimick AbstractVector it is ~ 1 year of relatively active use of the type to catch all cases.

In summary - while we do not have to expect AbstractVector I propose to expect it for now (this was my original proposal on Discourse) and if we get reasonable requests to lift this requirement it would be a non-breaking change anyway.

BTW - we should implement a proper method for NamedTuple of vectors for this as this is the major use case that is currently problematic (preferably in this PR).

bkamins avatar Mar 13 '22 16:03 bkamins

Yes, clearly it's simpler for table implementations to subtype AbstractVector to benefit from all the fallbacks defined by Base. But This is probably what they will do anyway. We don't have to require it.

In summary - while we do not have to expect AbstractVector I propose to expect it for now (this was my original proposal on Discourse) and if we get reasonable requests to lift this requirement it would be a non-breaking change anyway.

@bkamins Actually I think that if we require an AbstractVector result now, lifting that requirement later would be breaking. It's the reverse move which wouldn't be breaking. If we require subtyping AbstractVector now, Tables.jl users will be able to rely on this type (e.g. for dispatch). Of course the breakage would be limited to new types that would not subtype AbstractVector, but still.

Anyway, since it seems that @juliohm would actually be OK with having Meshes.DataRows <: AbstractVector, maybe we don't really need this new API, and can instead tell users who want to index rows to check whether Tables.rows(table) returns an AbstractVector? This is what @bkamins originally proposed on Discourse.

nalimilan avatar Mar 13 '22 16:03 nalimilan

This is what @bkamins originally proposed on Discourse.

This was my original proposal and I am convinced to it if we are sure we are OK with requiring AbstractVector. However, if we would consider lifting this requirement in the future (even if we think we want to require it now) we would need another function.

In summary I will re-state my question from Discourse:

Does anyone object against requiring indexable rows-type to be AbstractVector subtype?

My 2cents when having a distinction could make sense (but I am not even sure if currently such table type exists so maybe it is only a hypothetical scenario).

Assume you have a table type that can serve rows via an iterator very cheaply. It potentially also can serve rows via indexable object but with some more overhead (but less than collecting the rows iterator). Then the user could choose if one wants a faster rows or a bit slower (but still faster than collect) indexablerows.

bkamins avatar Mar 13 '22 18:03 bkamins

we would need another function.

Could we add a Tables.nrows method which falls back to returning Inf or nothing, but returns the number of rows in the "row-indexable" case?

ablaom avatar Mar 13 '22 21:03 ablaom

Could we add a Tables.nrows

There are already nrow and ncol functions in DataAPI.jl for this so I do not think we should add another function that would serve the same purpose. What we could do is improve their docstrings. I have opened https://github.com/JuliaData/DataAPI.jl/issues/45 to discuss this.

Also note that what you ask for is already defined as a part of iteration protocol. If iterator does not have a defined size it should return IsInfinite() or SizeUnknown() when IteratorSize is called.

From my perspective the point of this discussion (and this was reflected in my original proposal on Discourse) is not to define new concepts/types/functions for things that already covered in the data ecosystem as if we do it then it becomes increasingly difficult to:

  • keep consistency between them (developer's perspective);
  • learn them (user's perspective).

What I would propose is to have the following priority:

  1. Base Julia (i.e. whatever Base Julia defines should not be duplicated by anything else);
  2. DataAPI.jl and StatsAPI.jl (for their respective domain of coverage);
  3. Tables.jl (for tabular data);

I think it would be great to add packages from MLJ.jl into this list (to be explicit at what level of priority they fall). @ablaom - would it be only MLUtils.jl or there are also additional interface packages we should add?

The point is that if e.g. we want to change something in Tables.jl the maintainers responsibility is to make sure we do not duplicate something that is already defined in Base Julia or DataAPI.jl or StatsAPI.jl (this is exactly the case of nrow which is defined in DataAPI.jl already) or Base Julia (IteratorSize function for iterators is already there).

When I said "we would need another function" I meant that even having nrow defined or IteratorSize to HasLength or HasShape does not guarantee that the object is indexable - it only guarantees that the object has a defined size. Note that e.g. some iterators, even if their size is known cannot support random access. Example:

x = (rand() for i in 1:10)

is an iterator, it has length (10), it has shape, but it does not (and cannot) support random access.

bkamins avatar Mar 13 '22 21:03 bkamins

This was my original proposal and I am convinced to it if we are sure we are OK with requiring AbstractVector. However, if we would consider lifting this requirement in the future (even if we think we want to require it now) we would need another function.

AFAICT we don't need to require the result of Tables.rows to be a subtype AbstractVector: tables can do that if they can return an object that supports indexing. Users can check whether that's the case, and if not call Tables.columntable or Tables.rowtable depending on which would be the most suited for the operations they want to perform.

nalimilan avatar Mar 13 '22 22:03 nalimilan

@bkamins Yes, I agree the suggestion is a poor one after all.

would it be only MLUtils.jl or there are also additional interface packages we should add?

I do think it would be good to add MLUtils.jl to the mix. (It is not part of MLJ, however, and I have little connection to it.)

DataLoaders.jl, a popular tool for handling out-of-memory datasets in the deep learning community, is based on the MLUtils.jl data container API (previously in LearnBase.jl). I'm thinking that if we can get MLUtils to play nicely with tables then MLJ to could carry out observation resampling through that interface, rather than through a mixture of the Tables and AbstractArray interface, as currently. This would be more natural, allow us to jettison the MLJ row-indexing methods for tables, and be a way for some MLJ models to support training on data that does not fit into RAM.

ablaom avatar Mar 13 '22 22:03 ablaom

if someone would implement a non-AbstractVector type that would want to mimick AbstractVector it is ~ 1 year of relatively active use of the type to catch all cases.

A test suite (living in a dedicated package) could help with this. (This is just a general point, somewhat orthogonal to this discussion though).

Does anyone object against requiring indexable rows-type to be AbstractVector subtype?

It all depends on what it would be used for. If the users would expect to use it as vectors, then it is the right choice IMO.

tpapp avatar Mar 14 '22 08:03 tpapp

@quinnj + @nalimilan I have chatted with @ablaom about MLJ.jl requirements for Tables.jl interface. Here is a summary of what I think is crucial.

@quinnj + @nalimilan => what do you think would be the best approach to support it (given the discussions we already had + maybe some additional new ideas are needed?). Thank you!

Preliminary

  • The solution should be generic and work on any table, no matter if it fits into memory or not.
  • It might be the case that some tables cannot support all the required API; then there should be a way to perform a conversion to another table type that would work (this potentially might lead to e.g. out of memory error, but I think it is acceptable as then user will see where the problem is); still probably some slow but safe fallbacks could be provided by custom out-of-core table types.

Use case 1

A typical case in ML is doing Cross Validation Here it should be possible to get a CV fold that should preserve the access type (column/row) of the original table. Ideally this operation should be non-copying, but it is possible that in some cases it must copy, so this would be acceptable.

Use case 2

Accessing rows of table in a random order (e.g. when processing data in randomly selected batches). This essentially means that MLJ.jl is able to implement the https://mldatapatternjl.readthedocs.io/en/latest/documentation/container.html#data-container API for any Tables.jl table. Essentially this requires that nobs and getobs (for a single and multiple indices) can be efficiently implemented.


"Use case 1" is a special case of "use case 2" but maybe it could support a more set of more efficient methods. "Use case 2" might be possible only for in-memory tables, but even out-of-core tables might try to support it (or at least allow for some conversion after which it would be supported, see Preliminaries).

bkamins avatar Mar 20 '22 22:03 bkamins

Thanks for the detailed summary. So it looks like the interface defined here would satisfy these use cases. What's missing is that the fallback implementation should be changed a bit and the docstring be more explicit about some guarantees:

  • getindex should be implemented not only to access a single row, but also to take a subset of rows, in which case Tables.materializer should be used to return a table of the same type as the input.
  • This means that Tables.indexablerows cannot fall back to returning the result of Tables.rows, as it's currently not guaranteed to return a table of the same type when taking a subset (though we could change this requirement in the docs).
  • Instead of throwing an error when Tables.rows does not return an AbstractVector, the fallback could call Tables.columntable or Tables.rowtable (depending on the type of table) to ensure that it always works, even using lots of memory. I'm not really sure it's a good idea though: it may be better to have it throw an error, but provide a Tables.isrowindexable trait that you could check before calling Tables.rowindexable. Then it's easy to call Tables.columntable manually if you really want to proceed.

nalimilan avatar Mar 23 '22 09:03 nalimilan

not guaranteed to return a table of the same type when taking a subset

I think it is enough that the same "mode" of the table is returned, i.e. column or row oriented.

In particular maybe we should also explicitly document that view should be supported (to avoid copying data if user prefers this, which I think will be the case most of the time).


@ablaom - can you please confirm that what @nalimilan described + my comment would fulfill your needs on high-level (then we should finalize the design in detail and submit it to another check with you).

bkamins avatar Mar 23 '22 14:03 bkamins

Thanks @bkamins and @nalimilan .

Yes, I think what is described would fit my own needs.

Tables.indexablerows on multiple-index returns a table with the same row/column access (Use Case 1). In Use Case 2, I simply call Tables.rows first, to switch the access type, yes?

I agree that returning a table of the same type doesn't need to be a hard requirement. For a start, this seems inconsistent with @bkamins sensible suggestion that views are preferred, and views generally don't have the same type as the original table.

it may be better to have it throw an error, but provide a Tables.isrowindexable trait that you could check before calling Tables.rowindexable. Then it's easy to call Tables.columntable manually if you really want to proceed.

This is probably a question for the designers. But I'm curious what precisely isrowsindexable(X) is going to guarantee. I mean, it can never be a solid guarantee that I won't run out of memory, because that is not absolutely knowable. But what other kind of error could I expect?

ablaom avatar Mar 23 '22 21:03 ablaom

Tables.indexablerows on multiple-index returns a table with the same row/column access (Use Case 1). In Use Case 2, I simply call Tables.rows first, to switch the access type, yes?

Well, no, you'd call Tables.indexablerows in both cases. The point of adding this API is that Tables.rows is (currently) not guaranteed to be indexable.

This is probably a question for the designers. But I'm curious what precisely isrowsindexable(X) is going to guarantee. I mean, it can never be a solid guarantee that I won't run out of memory, because that is not absolutely knowable. But what other kind of error could I expect?

Here's one. :-) https://github.com/JuliaData/Tables.jl/pull/278/files#diff-12abce6071269cefb726c033c3832f2e0326e3a2fe769e67323a1b5ac0772483R111

nalimilan avatar Mar 23 '22 21:03 nalimilan

Well, no, you'd call Tables.indexablerows in both cases. The point of adding this API is that Tables.rows is (currently) not guaranteed to be indexable.

Perhaps there is a misunderstanding. I mean, if I'm in use case 2 and I want fast row indexing, then I calll Tables.indexablerows(Tables.rows(X)). In case 1, I just call Tables.indexablerows(X) because I want to preserve row/column mode of the result in that case.

Here's one. :-) https://github.com/JuliaData/Tables.jl/pull/278/files#diff-12abce6071269cefb726c033c3832f2e0326e3a2fe769e67323a1b5ac0772483R111

Ha ha. You are just pointing to an error thrown by the current proposal. I mean, if instead of throwing an error, the fallback just calls collect on the result of Tables.rows (when that result is not AbstractVector) what error (apart from memory overflow) could I expect? If only that kind of error is possible, then why try to catch it? Unless it's possible that no error is thrown but unexpected behaviour could result. Or perhaps, the point is the an out-of-memory error might eventually be thrown, but there could be a long delay one would prefer to avoid? I guess that makes sense. Maybe issue a warning instead??

I'm just thinking aloud here and of course happy for you to settle this detail.

ablaom avatar Mar 24 '22 02:03 ablaom

Perhaps there is a misunderstanding. I mean, if I'm in use case 2 and I want fast row indexing, then I calll Tables.indexablerows(Tables.rows(X)). In case 1, I just call Tables.indexablerows(X) because I want to preserve row/column mode of the result in that case.

I see. But Tables.rows isn't supposed to return an object for which row indexing will be more efficient. All it guarantees is that it supports the AbstractRows API. For example, with a DataFrame object, it Tables.rows returns a DataFrameRows which is just wrapper around the DataFrame, thus providing relatively efficient row iteration, but not faster than using eachrow directly on the DataFrame.

If we want to support different performance variants, maybe that should be done via an argument to Tables.indexablerows which would allow specifying whether you want the fastest possible object, even if that implies copying or recompiling some functions to specialize on column types. For example for a DataFrame that would involve returning a type-stable NamedTuple of vectors, without any copy but with specialization (and thus some compilation overhead). Is that the kind of performance you're looking for, or would the performance of DataFrame indexing be enough for you?

Ha ha. You are just pointing to an error thrown by the current proposal. I mean, if instead of throwing an error, the fallback just calls collect on the result of Tables.rows (when that result is not AbstractVector) what error (apart from memory overflow) could I expect? If only that kind of error is possible, then why try to catch it? Unless it's possible that no error is thrown but unexpected behaviour could result. Or perhaps, the point is the an out-of-memory error might eventually be thrown, but there could be a long delay one would prefer to avoid? I guess that makes sense. Maybe issue a warning instead??

My point was that calling collect(Tables.rows(t)) (or Tables.rowtable(t) or Tables.columntable(t), which will probably give a more efficient result) is really easy to do if you know you really want it. Going out of memory can sometimes kill Julia or other processes (or make the system hand as you say). But I realize that Tables.columns(t) will already make a copy when passed a row-oriented table so I guess having Tables.rowindexable would be consistent with that. Let's hear what @quinnj thinks since he wrote that line throwing an error.

nalimilan avatar Mar 24 '22 08:03 nalimilan

For example, with a DataFrame object, it Tables.rows returns a DataFrameRows which is just wrapper around the DataFrame, thus providing relatively efficient row iteration, but not faster than using eachrow directly on the DataFrame.

Sorry, I don't understand:

julia> eachrow(df) === Tables.rows(df)
true

Maybe you mean copy.(eachrow(df)) is faster than Tables.rows(df)?

But, yes, an option to get the fastest row-indexable object would be nice. I couldn't say if it is necessary without doing a lot of benchmarking. Is this the sort of thing you'd like to see before we proceed?

ablaom avatar Mar 25 '22 04:03 ablaom

I just said "not faster", because eachrow(df) isn't super fast either if you want to access rows many times (compared with a named tuple of columns in a function specialized on its type). I don't think benchmarking is needed, I just wonder what are your general expectations. For example, what are you doing currently in MLJ? Do you simply index rows of a DataFrame?

nalimilan avatar Mar 25 '22 10:03 nalimilan

As a side comment: eachrow(df) (and currently Tables.rows which calls it) is slow on purpose (to avoid compilation of very wide tables), and if you want type-stable version Tables.rowtable should be used.

In general Tables.rowtable and Tables.columntable respectively, if I understand the design intentions correctly correctly, are added exactly because if someone has a generic "table" and wants to turn it into in-memory table that is fast (i.e. type stable) and is row- or column- oriented respectively then you have a consistent way to achieve this.

bkamins avatar Mar 25 '22 11:03 bkamins

Currently, in use case 2 (models internally consuming data observation-by-observation or in observation batches) the model materialises the table as a matrix, and one takes the adjoint to get observations as columns, which is generally what the models we are wrapping expect. So, for optimal performance, a user might supply her data as Tables.table(Xmat'), where Xmat is an p x n matrix (n observations) so that the adjoints cancel out.

However, in requesting the use-case-2 Tables API enhancements, I am thinking more about other models which MLJ does not yet interface so well (eg, neural networks). These models provide the option of consuming data delivered not as a matrix, but using a more general API like the one at MLUtils (the getobs interface) which are suitable for data that may not fit into memory. However, I want to fit the special case of in-memory tables into this framework, and for this we need some kind of random access to rows.

Here's the way this will probably work in practise: a small (typically 1 <= n < 1000) batch of observations will be requested, this short table will likely be materialised as a p x n matrix (n the batch size) for consumption by the core model, and that process will be repeated many times. For instance, one might do this for every batch in a partition of the complete set of observations, shuffle the observations, re-partition, and then repeat fifty times, say. I don't say this is always the way this would work, but something like this is probably representative.

ablaom avatar Mar 27 '22 21:03 ablaom

additional comment from https://github.com/JuliaML/MLUtils.jl/issues/67#issuecomment-1092471714

bkamins avatar Apr 08 '22 08:04 bkamins

Could we get something like this merged before the start of GSoC projects? I know some students are interested in building a panel data interface, and I think that gets significantly easier if you can do basic Split/Apply/Combine methods on Tables.

ParadaCarleton avatar Apr 16 '22 15:04 ParadaCarleton

Ok, jumping back into things here; sorry for the delay; personal life has been a bit crazy lately.

  • getindex should be implemented not only to access a single row, but also to take a subset of rows, in which case Tables.materializer should be used to return a table of the same type as the input.

Hmmm, I didn't come to this conclusion reading @bkamins summary here. To me, it sounds like it's just desirable to have a Tables.rows-like thing (i.e. Tables.indexablerows) that's as efficient as possible, but not necessarily the same type as the original input. I actually think it would be fairly hard/rare for table types to support getindex w/ multiple indices AND return the same type as the original. i.e. that's excluding view from working on a vector of namedtuples.

This is yet another reason why requiring the result of Tables.indexablerows to subtype AbstractVector, since we get these kind of efficient "views" for free and they're efficient. As defined in this PR, we also get this "for free" for any column-oriented table source, since RowIterator can easily subtype AbstractVector and views on it will be efficient/non-copying.

That said, I don't feel strongly about requiring the subtyping, so I'm fine leaving that door open for now, though strongly recommending it and noting the minimum required methods to be implemented (which would be more than the currently defined Indexable interface in Base, which only requires single-int getindex).

The final point I'm still considering is whether we throw an error by default, or call indexablerows(columntable(x)) in the fallback case if the result of rows(x) isn't a subtype of AbstractVector. I'm leaning towards not throwing an error, since it would be slightly more convenient for users, and we can try to be clear of the ramifications of that in the docs. In particular, the use of Tables.indexablerows should follow a flow of:

  1. call irows = Tables.indexablerows(x) on the input x
  2. do things with irows repeatedly; getindex, etc.

What should be avoided, is repeatedly calling Tables.indexablerows(x) over and over, since it can potentially be a costly conversion. Just call it once, then re-use that indexable rows object for subsequent operations. @ablaom, it seems like that would work well in your case? We could rely on the generally efficient columntable(x) call to convert table inputs who don't natively support Tables.indexablerows and then you would use the result of indexablerows to do repeated partitioning/CV worklfows.

If that all sounds good, I'll clean this PR up: add some tests, lots of doc details, and we can merge in the next few days.

quinnj avatar Apr 26 '22 05:04 quinnj