DataFrames.jl Metadata on data frame and column level

This PR waits for https://github.com/JuliaData/DataAPI.jl/pull/48.

I have done an initial implementation. Now we need to discuss for which methods metadata propagation should happen. For now I have implemented it for getindex.

I stopped at hcat - if we hcat several data frames, how do you think we should handle metadata. Options are:

drop all metadata
use only left table metadata
merge metadata by overwriting left table metadata with right table metadata
merge metadata by ignoring right table metadata that is already in left table metadata

Which one do we pick (when we have this decision it will naturally propagate to other cases).

May 22 '22 21:05 bkamins

Fixes #2961 #35 #2276

May 22 '22 21:05 bkamins

Another option to consider is to use metadata only if there are no conflicts between input data frames (i.e. it's present in one but absent from others, or equal in all data frames that have it). The advantage is that it would be order-independent.

FWIW, R's rbind (for both data.frame and tibble) takes variable labels from the first table, even in case of conflict; for some reason, dplyr's bind_rows drops them). Not sure what Stata does. Maybe @pdeffebach knows.

May 23 '22 06:05 nalimilan

For joining in Stata, the left data frame takes precedence. I think this is the correct default, and we should do it in DataFrames.jl as well. See this gist describing Stata's behavior.

For hcat, since we don't have an option to overwrite column names (unless I'm forgetting), I think its fair for columns to keep their metadata even if they get renamed dup_column_1.

May 23 '22 13:05 pdeffebach

For joining in Stata, the left data frame takes precedence.

You mean that if left and right table have the same "table level" (not column level) metadata key, then value is kept from left table?

(please keep in mind that we will have two kind of metadata: table level and column level; now we are discussing table level metadata)

May 23 '22 13:05 bkamins

Ah. Sorry for the confusion.

Just did some research. It looks like Stata does not have named dataset-level dataset, for example "Date" or "Source". It's just a vector of strings. So Stata doesn't deal with this explicitly. All the notes just get added together.

But I still think having the left one be dominant is the right way to go.

May 23 '22 14:05 pdeffebach

Don't you think it would be confusing or even dangerous if doing hcat(januarydf, februarydf), with inputs having meta-data "month" => "January" and "month" => "February" respectively, gave a data frame with only "month" => "January"? I'd rather have at least some conflict detection, or just drop all metadata when calling hcat for now.

EDIT: joins are different as in leftjoin the first argument has the main role, and conversely for rightjoin; things are less clear for outerjoin and innerjoin.

May 23 '22 15:05 nalimilan

Good point. But still, left taking precedence seems like a consistent default that will cause fewer headaches for users than something as destructive as getting rid of metadata.

May 23 '22 16:05 pdeffebach

I would also agree that having the left data be dominant makes sense. It's the table for which you're keeping all keys (+ rows) and the joining table is "additional", so it feels like that would make sense to me.

May 23 '22 18:05 quinnj

@quinnj You're thinking about leftjoin, right? What about rightjoin?

May 23 '22 19:05 nalimilan

@nalimilan - can you please have a look at the implementation? If it is OK for you I will go ahead and add:

manual section on metadata
tests
track all places where propagating metadata would make sense

I have implemented both table and column level metadata.

May 23 '22 21:05 bkamins

The dict of dicts approach to store per-column metadata can always be improved later if needed.

We should decide on it now. The reason is that breaking internals of DataFrame breaks serialization, so we should not do such changes too often. I made "dict of dicts" as if only few columns have metadata it uses least memory. What alternatives do you see? Vector of dicts of dict of vectors?

May 24 '22 16:05 bkamins

Note: this PR needs to wait till https://github.com/JuliaData/DataFrames.jl/pull/3047 is merged. Then it needs to be rebased (the reason is that in https://github.com/JuliaData/DataFrames.jl/pull/3047 we add methods that have to handle metadata correctly)

May 24 '22 18:05 bkamins

We should decide on it now. The reason is that breaking internals of DataFrame breaks serialization, so we should not do such changes too often. I made "dict of dicts" as if only few columns have metadata it uses least memory. What alternatives do you see? Vector of dicts of dict of vectors?

Yes. More precisely a Dict{String, Vector} holding Vector{Union{Nothing, Some}} objects with one entry per column equal to nothing if no metadata is set for a column. The advantage would be to use less memory and to reduce the number of objects that the GC has to track in the case where all columns share common metadata keys (the main use case I have in mind is column labels), and there are more columns than different metadata keys. But given that the vectors wouldn't be concretely typed I'm not sure the gain would be so large, and you're right that it wouldn't be as efficient if metadata is set only for some columns. It would also be more complex as metadata(df, col) would have to return a custom lazy AbstractDict that would be a view of this structure.

May 24 '22 19:05 nalimilan

Ah - now I see we do not need to wait for https://github.com/JuliaData/DataFrames.jl/pull/3047 as I intentionally kept there only functions that do not mutate list of columns. Problematic will be e.g. pushfirst! but I will open a PR for this later.

May 24 '22 19:05 bkamins

Looking at the PR right now, is it true that if the column :y has metadata, then

transform(df, :y => f => :y)

will destroy that metadata?

Jun 08 '22 15:06 pdeffebach

Yes. The idea is that transform(df, :y => ByRow(y -> y^2) => :y) will make metadata such as :unit => "m/s" incorrect.

Jun 08 '22 15:06 nalimilan

only if f were identity or copy the metadata would be retained.

Jun 08 '22 18:06 bkamins

Okay. I guess the equivalent in Stata is replace x = ... if .... And we don't have that feature yet.

Jun 08 '22 18:06 pdeffebach

I guess the equivalent in Stata is replace x = ... if .... And we don't have that feature yet.

Could you please elaborate what you mean there? Thank you!

Jun 08 '22 18:06 bkamins

Stata has the command replace which is meant for replacing a variable. So you might say

replace x = missing if x == -999

or even

replace y  = y^2

this preserves metadata.

To perform a destructive transformation, which deletes metadata like transform(df, :y => (y -> y^2) => :y), you would create a new variable and rename.

gen y2 = y^2
drop y2
rename y2 y

So maybe there is room for a replace function someplace, some time in the future, for the purposes of transformations which preserve metadata (even though you can do replace y = y^2 which messes up units).

Jun 08 '22 19:06 pdeffebach

@pdeffebach - so essentially you mean that you would like to be able to give a hint if some transformation should retain metadata or not.

By default we will drop it, but I think we could, in the future, add something like source => fun => KeepMeta(:some_col) to hint that metadata in :some_col should be kept if this column is already present in a data frame and has metadata. I would leave such feature for the future.

Jun 08 '22 20:06 bkamins

At least to replace only some values you would do transform!(view(df, df.x .== -999)), :y => ByRow(y -> y^2) => :y), for which it would make sense to preserve metadata. Whether it's easy to do is another question... :-D

Jun 08 '22 21:06 nalimilan

@nalimilan - could you please review the PR on a high level at this point (not the details of the implementation but the general approach and look&feel). I need to stop working on it and wait for https://github.com/JuliaData/DataFrames.jl/pull/3070, https://github.com/JuliaData/DataFrames.jl/pull/3071, and https://github.com/JuliaData/DataFrames.jl/pull/3072 to be merged as otherwise the number of merge conflicts will be very problematic (I already had problems with correct resolution of merge conflicts in #3070 after your threads PR). Thank you!

The plan should be:

you do high-level review to make sure that the approach we have is OK
then we merge #3070, #3071, #3072
then we merge and release https://github.com/JuliaData/DataAPI.jl/pull/48 (by then we should be sure all is OK)
then I update this PR, add tests and examples (I cannot add tests and examples before we merge https://github.com/JuliaData/DataAPI.jl/pull/48)

OK?

Jun 12 '22 19:06 bkamins

I want to bring up the point again about transform destroying metadata when it overwrite columns. I think operations like

@rtransform :x = coalesce(:x, 0)

are more common than

@rtransform df :x = :x^2

and I'm worried about constantly worrying about propagating metadata being a big overhead.

Jun 24 '22 13:06 pdeffebach

destroying metadata when it overwrite columns

I understand your concerns. What exact rules would you propose?

E.g. in this case:

@rtransform df :x = :y :z = :x :y = :z ^2

Jun 24 '22 14:06 bkamins

I think we should keep existing metadata of the destination always, and leave it up to the user to change the metadata. :x = :y, keep metadata. :x = :x^2 also keep metadata.

Jun 24 '22 18:06 pdeffebach

So we should ignore source metadata unless it is a new column? Is this what Stata does?

Jun 24 '22 21:06 bkamins

I have thought about @pdeffebach request and I think it would make sense however, we need to make a decision. We have two possible rule sets

Rule 1 (simpler)

Rule:

if in some operation there is a single input column and a single output column then metadata of source column replaces metadata of target column (with the exception of unstack - see its docstring)

if in some operation there are multiple input columns metadata in the output is dropped (with the exception of vcat and stack - see its docstring)

Rule 2 (more complex)

Rule:

if in some operation there is a single input column and a single output column then metadata of source column replaces metadata of target column in select[!] and transform[!] but it is dropped in all other operations (like combine and mapcols)

if in some operation there are multiple input columns metadata in the output is dropped (with the exception of vcat and stack - see its docstring)

Under both rules we would need to add a clear explanation to the users that if column metadata is invalidated by the transformation they do (e.g. unit of measure and we square column) then they need to manually drop it. I think it is OK, as in this way we "correctly keep metadata in most common cases" and "incorrectly keep it in few cases". While other rules would be more conservative and would mean that we "incorrectly drop metadata in many cases".

Now regarding the choice between rule 1 and rule 2:

under rule 1 all is consistent and we have simpler rules to learn by users, but we more often incorrectly propagate metadata;
under rule 2 users need to learn more special cases, especially combine vs select/transform, but they get a correct result more often. However "more often" here is soft - the number of cases when this is true is not that much, e.g. combine(df, :x => sum) would drop :x metadata, while select(df, :x => sum) would keep it, but probably it should drop it so things are not clear cut.

I personally would go for rule 1 so have a simpler ruleset.

@nalimilan, @pdeffebach, @matthieugomez, @nilshg, @oxinabox - can you please comment what you think? I would like to close this discussion, implement what we agree on and make a 1.4 release for JuliaCon2022.

To add context to the discussion: we have been discussing adding metadata for years now. There is no one solution that will work in all cases. We need to make some decision and go for it (and I document that metadata handling might change in the future and it will not be considered a breaking change, so we can correct things if we need in future releases in 1.x branch). From development perspective this PR needs to be made in one-shot as it affects the whole package, so I have other development stalled because otherwise it is near to impossible to correctly make all the changes (I have already rebased this PR once and it took me several days to do so).

Jun 26 '22 09:06 bkamins

I'm in favor of rule 1. I agree that keeping things simpler/more predictable for users will have the most benefit long term. And we can always tweak a case or two if they come up often as confusing/incorrect. Thanks for the detailed write up @bkamins and for all the work on this.

Jun 26 '22 14:06 quinnj

Could we follow rule 1, but with the additional condition that the output column has the same name as the input? You're much more likely to do :x = coalesce(:x, 0) than :x = :x^2, as the latter completely changes the meaning of the values, which would be confusing if the name is preserved.

Anyway I think we all agree that we'd better merge something relatively simple now, and continue discussing and refining the rules later if necessary.

Jun 26 '22 16:06 nalimilan

but with the additional condition that the output column has the same name as the input?

Well - then the problem is that column renaming like :y = :x would then drop metadata. The rules above were designed to make sure that they cover column renaming (and also - in consequence transformations like :y = identity(:x) and :y = copy(:x) which I think should also keep metadata).

Of course it is not a problem to add additional exception for column renaming in the rules, so I am not sure what is better.

Jun 26 '22 16:06 bkamins

Yes then we would have two rules: one for equality of names, and one for identity/copy/rename operations. Not sure whether that's OK or not, would need to read the docs carefully again.

Jun 26 '22 16:06 nalimilan

Not sure whether that's OK or not

That is OK from the "logical consistency level". Let us wait for users' feedback (especially @pdeffebach as he has a lot of experience with working with metadata)

Jun 26 '22 16:06 bkamins

@nalimilan I have updated the description of metadata propagation rules in docs/src/lib/metadata.md to follow what we have discussed.

Jun 27 '22 07:06 bkamins

Sorry for the delay on this. I'm with @nalimilan (I think). I would say

:x = :y

shouldn't forward metadata, if we are keeping things simple for now. Adding a rule about when an operation is a simple copy etc. probably makes things too complex. I understand the irony of saying this because I am advocating for a more complex rule in having :x = coalesce(:x, 0) propagate.

But yes overall I think rule 1 is good an makes sense.

Jun 28 '22 14:06 pdeffebach

I would say :x = :y shouldn't forward metadata

So what should happen with metadata in this case? Actually there are two cases: a) :x is already present in the source data frame b) :x is not present yet in the source data frame

Also then what should happen if you write rename(df, :y => :x) as this is essentially the same operation.

Our current understanding was that column renaming (copying or non-copying) should take metadata from the source column.

Jun 28 '22 14:06 bkamins

In Stata,

gen y = x

does not forward metadata. So that's where my intuition comes from.

Stata has

replace x = x + 1

which does preserve metadata, and is the motivation for my proposed rule in transform. We have no equivalent to replace, and writing a separate replace function doesn't make a lot of sense when transform can have multiple transformations in a single call, i.e.

@rtransform df begin 
    :x =:x + 1 # preserves metadata
    :y = :x ^2 # does not preserve metadata
end

Stata also has rename y x (rename y to x) which does preserve metadata.

Jun 29 '22 12:06 pdeffebach

yes, but in Stata:

gen y = x

strictly generates new variable AFACIT (I understand that if y were alrady present it would error - right?)

Side question what does:

replace y = x

do?

But apart from these discussions I do not see a reason why gen y = x would not set metadata for y to be equal to the one of x. What is the logic behind such behavior?

Jun 29 '22 13:06 bkamins

Thinking more about it - the Stata logic would be as follows?

keep metadata of existing columns unchanged;
new columns have no metadata.

Is this correct? Maybe indeed this is a rule that is good enough? (it is easy to understand and easy to implement)

Jun 29 '22 13:06 bkamins

keep metadata of existing columns unchanged;

new columns have no metadata.

Yes. I think this is a clear and consistent rule. Of course, it allows for :x = :x^2. But I think this is a small sacrifice and people can manage that meta-data manually.

Jun 29 '22 13:06 pdeffebach

@nalimilan (and anyone else willing to) - this PR should be ready for review (or at least friendly testing). Apart from adding metadata new tests caught rename and rename! bug + I decided to make completecases and nonunique a bit more flexible in corner cases (to make them support rare situations rather than throw an error in them).

Thank you!

Jul 04 '22 19:07 bkamins

This PR should be ready for the next review.

What things need to be verified:

If we feel that the API using hasmetadata/hascolmetadata/metadata/colmetadata is convenient enough; if yes then a related PR https://github.com/JuliaData/DataAPI.jl/pull/48 should be merged and this PR updated to use the interface defined in DataAPI.jl.
If we like the metadata propagation rules (we are in agreement with @nalimilan what the rules should be, and I feel I do not want to change them for 1.4 release, but maybe some useful comments would be given).
Some friendly testing of the PR (which would additionally confirm if we like point 1. and point 2. in practice).

Some specific mentions (if you do not have time to go through the whole PR - which is likely - it is enough to read the new material on metadata that is present in the manual):

@pdeffebach - the propagation rules we implement are not the same as in Stata, but I think our rules are better - of course it is debatable :)
@krynju - can you please have a look at this ecosystem, as it would be good to consistently support it in DTable (and this is the reason why I am waiting with looking at your changes in DTable interface - this PR does some changes almost every part of DataFrames.jl so it would be good to finalize it before we extract out the interface)
@quinnj - after this PR is merged it would be great to have Arrow.jl support reading/writing the metadata using the defined API, so can you please confirm that you like it?
@ExpandingMan - after this PR is merged it would be great to have Parquet2.jl support reading/writing the metadata using the defined API, so can you please confirm that you like it?
@quinnj - additionally my plan is that after this PR I create a small package that is able to inject table metadata into CSVs in TOML format as comments taking advantage of https://csv.juliadata.org/v0.5/#Commented-Rows-1. However, maybe you feel that this is a functionality that makes sense to be a part of CSV.jl (if yes we can discuss separately in CSV.jl issue after this PR is merged)
@oxinabox - your general comment, from the user's perspective, if adding metadata to DataFrames.jl in the way proposed would be useful are welcome.

This PR adds metadata support, in a follow-up (probably not in 1.4 release) we discuss in #3076 how metadata should be handled when displaying data frames (maybe even this should be moved to PrettyTables.jl for a more general support of metadata display for any table). This will be discussed later with @ronisbr.

Jul 18 '22 06:07 bkamins

While this looks about as reasonable as it can be, I remain convinced of the rather unpopular idea that metadata should not be part of DataFrame, I don't see it as fitting nicely with the "ordered named tuple of like-length rank-1 arrays" concept that is so natural for tables.

The ambiguities arising from row-wise operations are a case-in-point. Even operations involving a single column don't have clear semantics for the metadata. For example, with the design of this PR, transformations such as filtering preserve metadata, but there is a huge class of cases in which this is totally inappropriate. Indeed, I immediately hit an example for Parquet2, parquet metadata includes column statistics none of which are preserved under all the operations listed here. I'm quite uncomfortable with all the other conventions listed here as well, not because I feel they were badly chosen or not well-thought-out, but because they simply don't generalize naturally and it's not at all clear (to me at least) that any of these will turn out to be a good default choice. At best, these conventions will fail spectacularly some of the time.

I remain of the opinion that metadata should not be included in DataFrames.jl but should be implemented in a separate package that depends only on Tables.jl and which is willing to make more strict guarantees about what exactly is permitted.

In the meantime, I'll comment on the compatibility of this PR with parquet. Parquet allows for Dict{String,String} metadata associated with the entire table as well as each column, much as in this PR. Additionally, it allows for some statistics which I currently handle by optionally wrapping the output in a special AbstractVector. All of this I think would be well-accommodated by this PR, but with the aforementioned caveat that literally none of it (I think) would be guaranteed to make any sense after table transformations. Since my AbstractVector types are referred to by a DataFrame if copycols=false, I would probably elect to only copy the arbitrary metadata Dict{String,String} and not include the statistics.

Jul 18 '22 21:07 ExpandingMan

@ExpandingMan - your point is the reason why metadata was postponed for so long. No one is really sure what is best here.

My original opinion was to never keep metadata under any transformation (which would guarantee we do not generate "false positives"). In this approach metadata would be mostly used as a way to attach reference information to a data frame when it is saved/loaded. I guess this would be consistent with your recommendation.

However, under such an approach users complained that a lot of code using metadata would look like:

1. take some data frame
2. perform operation on it
3. copy metadata from source to target
4. repeat steps 2 and 3

so we decided to try defining some metadata propagation rules. We are aware that such rules will produce "false positive" cases, in which user will have to manually wipe metadata.

I am personally not sure what is best. Also - to be clear - although I invested a lot of time into this PR - I am totally fine to drop it (or minimize its scope) if we decide to do so. However, I would welcome if we had a really serious discussion about what we want and after this decision is made to follow it and not go back to it in a future. The point of my effort was to provide a complete implementation of metadata propagation rules that are consistent under assumption that we want to propagate as much metadata as it makes sense so that we have something concrete to discuss about (as opposed to discussion what-if scenarios). In the PR it is clear what are the consequences of the requirement to propagate metadata.

Jul 18 '22 22:07 bkamins

your point is the reason why metadata was postponed for so long. No one is really sure what is best here.

I find that when that sort of thing happens the best thing to do is take the hint and move on.

More broadly, I think what is happening here is that the generalization of "dataframes" (which we by now have defined pretty clearly as ordered named tuples of like-size rank-1 arrays) to "dataframes with metadata" is not only non-unique but there isn't even a "natural" choice of generalization that we could reasonably expect to work for most use cases.

My recommendation would be to, rather than making an attempt at a general implementation that is likely to be plagued with issues, make more specific implementations that are more appropriate for more specific cases. The way I see it, what you are trying to do here can only be adequately achieved by developers creating implementations as needed. The great thing about Julia and DataFrames.jl is that writing such implementations is far easier than it would be with e.g. pandas. Why there are not more existing implementations I can only speculate, but perhaps the best course of action right now is ti write a package that implements the overall structure you are proposing in this PR but which depends only on Tables.jl. This would avoid committing DataFrames.jl to such a speculative design and might turn out to be very revealing through issues that get raised for it.

Anyway, as the tone of the preceding has been quite negative, I want to make sure I say that I hugely appreciate all the work you are doing here. Indeed, the main reason I'm extremely apprehensive about putting something like this in DataFrames.jl is that I consider the package as is quite pristine, and I am worried about it getting degraded by feature creep.

Jul 18 '22 22:07 ExpandingMan

Also, if I cannot talk you out of committing to this, my next suggestion would be to (at least initially) completely discard metadata when any changes are made unless explicitly set otherwise.

I could be wrong but I think it would be easier to start there and then add persistence later than to start out with a bunch of a ad hoc persistence rules and later trying to roll them back. (I believe it would be a lot less breaking.)

Jul 18 '22 22:07 ExpandingMan

I don't think @bkamins needs to maintain the "purity" of DataFrames. Useability should take precedent and as I've argued in this thread above, being over-eager in destroying metadata would make metadata unworkable in practice. Every single transformation would require "forwarding" and it would be a pain to work with.

If you are familiar with Stata's labeling and notes, the workflow in this PR reproduces that workflow in a way that is almost as easy to use as Stata, which is quite a feat.

A Tables.jl-focused re-write of metadata would be problematic

Current Tables-agnostic transformations in TableOperations.jl are extremely limited. The DataFrames.jl mini-language for transformations is expansive and useful, but is opinionated. Replacing theDataFrames.jl transform calls with tables.jl-neutral ones, just for the purposes of propagating metadata would be a hit to useability.
It is true that Julia, in theory, allows for many different implementations of DataFrames. But that is not always a good thing. For one, the DataFrames API is very large, and any package that re-implemented it for a concrete type would effectively be rewriting DataFrames. The complicatedness of DataFrames is the reason we do not have separate implementations. This is a good thing, though. Standardization is valuable. It would be a shame if we had a fractured ecosystem just because some people disagree about a few API opinions about metadata.

The ambiguities arising from row-wise operations are a case-in-point. Even operations involving a single column don't have clear semantics for the metadata.

There are many scenarios with unclear semantics ex-ante. But we cannot limit ourselves only to semantics which are ex-ante clear. Rather, DataFrames.jl gets to document the semantics and standardize them.

Jul 19 '22 20:07 pdeffebach

FWIW, I just checked out the branch and built the docs. I think the API is great.

Here are two helper functions I quickly wrote to make things a bit easier.

julia> function label!(df, p::Pair)
           name = first(p)
           lab = last(p)
           colmetadata(df, name)["label"] = lab
           return df
       end;

julia> function labels(df)
           n = names(df)
           out = DataFrame(Column = n)
           labels = map(n) do ni
               cdict = colmetadata(df, ni)
               get(cdict, "label", missing)
           end
           out.Label = labels
           out
       end;

julia> df = DataFrame(sales_1w_end = rand(100));

julia> @chain df begin
           @rtransform! :log_sales = log(:sales_1w_end)
           label!(:log_sales => "Logarithm of sales")
       end;

julia> labels(df)
2×2 DataFrame
 Row │ Column        Label              
     │ String        String?            
─────┼──────────────────────────────────
   1 │ sales_1w_end  missing            
   2 │ log_sales     Logarithm of sales

I think a helper package to force everyone to agree on label etc. would be helpful. I would be happy to write it.

Jul 19 '22 21:07 pdeffebach

I have opened a pool on Discourse for this topic at https://discourse.julialang.org/t/dataframes-jl-metadata/84544. Anyone interested is invited to vote. The objective of the pool is to find out what the community in general prefers. I hope it will help us to reach a good decision.

Jul 20 '22 15:07 bkamins

In the meantime, I'll comment on the compatibility of this PR with parquet. Parquet allows for Dict{String,String} metadata associated with the entire table as well as each column, much as in this PR. Additionally, it allows for some statistics which I currently handle by optionally wrapping the output in a special AbstractVector. All of this I think would be well-accommodated by this PR, but with the aforementioned caveat that literally none of it (I think) would be guaranteed to make any sense after table transformations. Since my AbstractVector types are referred to by a DataFrame if copycols=false, I would probably elect to only copy the arbitrary metadata Dict{String,String} and not include the statistics.

It seems to me that only the former (Dict{String,String}) should be considered as metadata in the sense of this PR (and also in the sense of R or Stata). Statistics on the data should be handled differently (like with a custom vector type as you currently do). Indeed a major reason to support metadata rather than requiring users to handle a dict separately is to be able to propagate it automatically in at least some operations, and taking a subset is the poster child for this. Adopting a definition of metadata that doesn't allow propagating it when subsetting rows/columns would be self-defeating IMO.

Jul 21 '22 09:07 nalimilan

I think it would be good to add to the documentation what happens in groupings (metadata is preserved).

Jul 21 '22 13:07 tp2750

@tp2750

It is documented here https://github.com/JuliaData/DataFrames.jl/pull/3055/files#diff-5ec522580bc313447378c5cab2f8fb37d72aeba71dbb98ca8d3b50a05e2f163aR10

how would you propose to improve it?

Jul 21 '22 17:07 bkamins

Statistics on the data should be handled differently (like with a custom vector type as you currently do). Indeed a major reason to support metadata rather than requiring users to handle a dict separately is to be able to propagate it automatically in at least some operations, and taking a subset is the poster child for this. Adopting a definition of metadata that doesn't allow propagating it when subsetting rows/columns would be self-defeating IMO.

That we are quickly led to the conclusion that statistics cannot be included in metadata seems like a red flag to me. When you exclude both statistics and the kinds of type information that are handled by the AbstractVector container what you are left with are cases that look suspiciously application-specific. Can we even generate many good examples of use cases for this? (there may have been extensive discussion on this before that I have missed)

Jul 22 '22 16:07 ExpandingMan

@pdeffebach , actually I realize I more-or-less agree with the second half of this comment. I should not have said an implementation that requires only Tables.jl, that indeed sounds overly optimistic. I am more concerned about putting a highly dubious feature into DataFrames.jl that has to be maintained and we are stuck with for all eternity, any experiment with metadata that does not lead to a similarly bad result would seem fine to me.

Jul 22 '22 16:07 ExpandingMan

OK - the monster PR is ready for your checking.

The design follows https://github.com/JuliaData/DataAPI.jl/pull/48:

:none style metadata (both table and column level) is never propagated except in DataFrame constructor and copy
:note style metadata is propagated following the rules discussed in this PR

Aug 30 '22 18:08 bkamins

@nalimilan - I am finishing applying patches. codecov misses will be gone after DataAPI.jl is released.

Sep 01 '22 06:09 bkamins

Maybe we should say "table-level", "column-level" and ":note-style" (with dashes) everywhere as it makes reading complex expressions like "table-level and column-level :note-style metadata" easier? I'd like a confirmation by a native speaker.

This - thing is more complex (I am finishing copyediting of https://www.manning.com/books/julia-for-data-analysis so I had a lot of such discussions). However, I propose to just switch to the style you outlined for improved readability (without going into details of English grammar as they sometimes imply - should be used and sometimes it should not be used in these cases AFAICT).

Sep 09 '22 17:09 bkamins

@nalimilan tests will fail on Julia 1.0, but let us ignore it. We will soon bump the required Julia versions to at least 1.6 before 1.4 release. But I want to do it in a separate PR.

Sep 10 '22 20:09 bkamins

@nalimilan - I am done with the updates after your review. metadata.jl is significantly refactored. I have resolved all comments that I believe are clear. The only open is https://github.com/JuliaData/DataFrames.jl/pull/3055#discussion_r967794601, but I think it is OK what I do, so this also can be resolved if you agree.

Sep 11 '22 21:09 bkamins

DataFrames.jl DataFrames.jl copied to clipboard

Metadata on data frame and column level

Rule 1 (simpler)

Rule 2 (more complex)

DataFrames.jl
DataFrames.jl copied to clipboard