DataFrames.jl
DataFrames.jl copied to clipboard
Metadata on data frame and column level
This PR waits for https://github.com/JuliaData/DataAPI.jl/pull/48.
I have done an initial implementation. Now we need to discuss for which methods metadata propagation should happen. For now I have implemented it for getindex.
I stopped at hcat - if we hcat several data frames, how do you think we should handle metadata. Options are:
- drop all metadata
- use only left table metadata
- merge metadata by overwriting left table metadata with right table metadata
- merge metadata by ignoring right table metadata that is already in left table metadata
Which one do we pick (when we have this decision it will naturally propagate to other cases).
Fixes #2961 #35 #2276
Another option to consider is to use metadata only if there are no conflicts between input data frames (i.e. it's present in one but absent from others, or equal in all data frames that have it). The advantage is that it would be order-independent.
FWIW, R's rbind (for both data.frame and tibble) takes variable labels from the first table, even in case of conflict; for some reason, dplyr's bind_rows drops them). Not sure what Stata does. Maybe @pdeffebach knows.
For joining in Stata, the left data frame takes precedence. I think this is the correct default, and we should do it in DataFrames.jl as well. See this gist describing Stata's behavior.
For hcat, since we don't have an option to overwrite column names (unless I'm forgetting), I think its fair for columns to keep their metadata even if they get renamed dup_column_1.
For joining in Stata, the left data frame takes precedence.
You mean that if left and right table have the same "table level" (not column level) metadata key, then value is kept from left table?
(please keep in mind that we will have two kind of metadata: table level and column level; now we are discussing table level metadata)
Ah. Sorry for the confusion.
Just did some research. It looks like Stata does not have named dataset-level dataset, for example "Date" or "Source". It's just a vector of strings. So Stata doesn't deal with this explicitly. All the notes just get added together.
But I still think having the left one be dominant is the right way to go.
Don't you think it would be confusing or even dangerous if doing hcat(januarydf, februarydf), with inputs having meta-data "month" => "January" and "month" => "February" respectively, gave a data frame with only "month" => "January"? I'd rather have at least some conflict detection, or just drop all metadata when calling hcat for now.
EDIT: joins are different as in leftjoin the first argument has the main role, and conversely for rightjoin; things are less clear for outerjoin and innerjoin.
Good point. But still, left taking precedence seems like a consistent default that will cause fewer headaches for users than something as destructive as getting rid of metadata.
I would also agree that having the left data be dominant makes sense. It's the table for which you're keeping all keys (+ rows) and the joining table is "additional", so it feels like that would make sense to me.
@quinnj You're thinking about leftjoin, right? What about rightjoin?
@nalimilan - can you please have a look at the implementation? If it is OK for you I will go ahead and add:
- manual section on metadata
- tests
- track all places where propagating metadata would make sense
I have implemented both table and column level metadata.
The dict of dicts approach to store per-column metadata can always be improved later if needed.
We should decide on it now. The reason is that breaking internals of DataFrame breaks serialization, so we should not do such changes too often. I made "dict of dicts" as if only few columns have metadata it uses least memory. What alternatives do you see? Vector of dicts of dict of vectors?
Note: this PR needs to wait till https://github.com/JuliaData/DataFrames.jl/pull/3047 is merged. Then it needs to be rebased (the reason is that in https://github.com/JuliaData/DataFrames.jl/pull/3047 we add methods that have to handle metadata correctly)
We should decide on it now. The reason is that breaking internals of
DataFramebreaks serialization, so we should not do such changes too often. I made "dict of dicts" as if only few columns have metadata it uses least memory. What alternatives do you see? Vector of dicts of dict of vectors?
Yes. More precisely a Dict{String, Vector} holding Vector{Union{Nothing, Some}} objects with one entry per column equal to nothing if no metadata is set for a column. The advantage would be to use less memory and to reduce the number of objects that the GC has to track in the case where all columns share common metadata keys (the main use case I have in mind is column labels), and there are more columns than different metadata keys. But given that the vectors wouldn't be concretely typed I'm not sure the gain would be so large, and you're right that it wouldn't be as efficient if metadata is set only for some columns. It would also be more complex as metadata(df, col) would have to return a custom lazy AbstractDict that would be a view of this structure.
Ah - now I see we do not need to wait for https://github.com/JuliaData/DataFrames.jl/pull/3047 as I intentionally kept there only functions that do not mutate list of columns. Problematic will be e.g. pushfirst! but I will open a PR for this later.
Looking at the PR right now, is it true that if the column :y has metadata, then
transform(df, :y => f => :y)
will destroy that metadata?
Yes. The idea is that transform(df, :y => ByRow(y -> y^2) => :y) will make metadata such as :unit => "m/s" incorrect.
only if f were identity or copy the metadata would be retained.
Okay. I guess the equivalent in Stata is replace x = ... if .... And we don't have that feature yet.
I guess the equivalent in Stata is
replace x = ... if .... And we don't have that feature yet.
Could you please elaborate what you mean there? Thank you!
Stata has the command replace which is meant for replacing a variable. So you might say
replace x = missing if x == -999
or even
replace y = y^2
this preserves metadata.
To perform a destructive transformation, which deletes metadata like transform(df, :y => (y -> y^2) => :y), you would create a new variable and rename.
gen y2 = y^2
drop y2
rename y2 y
So maybe there is room for a replace function someplace, some time in the future, for the purposes of transformations which preserve metadata (even though you can do replace y = y^2 which messes up units).
@pdeffebach - so essentially you mean that you would like to be able to give a hint if some transformation should retain metadata or not.
By default we will drop it, but I think we could, in the future, add something like source => fun => KeepMeta(:some_col) to hint that metadata in :some_col should be kept if this column is already present in a data frame and has metadata. I would leave such feature for the future.
At least to replace only some values you would do transform!(view(df, df.x .== -999)), :y => ByRow(y -> y^2) => :y), for which it would make sense to preserve metadata. Whether it's easy to do is another question... :-D
@nalimilan - could you please review the PR on a high level at this point (not the details of the implementation but the general approach and look&feel). I need to stop working on it and wait for https://github.com/JuliaData/DataFrames.jl/pull/3070, https://github.com/JuliaData/DataFrames.jl/pull/3071, and https://github.com/JuliaData/DataFrames.jl/pull/3072 to be merged as otherwise the number of merge conflicts will be very problematic (I already had problems with correct resolution of merge conflicts in #3070 after your threads PR). Thank you!
The plan should be:
- you do high-level review to make sure that the approach we have is OK
- then we merge #3070, #3071, #3072
- then we merge and release https://github.com/JuliaData/DataAPI.jl/pull/48 (by then we should be sure all is OK)
- then I update this PR, add tests and examples (I cannot add tests and examples before we merge https://github.com/JuliaData/DataAPI.jl/pull/48)
OK?
I want to bring up the point again about transform destroying metadata when it overwrite columns. I think operations like
@rtransform :x = coalesce(:x, 0)
are more common than
@rtransform df :x = :x^2
and I'm worried about constantly worrying about propagating metadata being a big overhead.
destroying metadata when it overwrite columns
I understand your concerns. What exact rules would you propose?
E.g. in this case:
@rtransform df :x = :y :z = :x :y = :z ^2
I think we should keep existing metadata of the destination always, and leave it up to the user to change the metadata. :x = :y, keep metadata. :x = :x^2 also keep metadata.
So we should ignore source metadata unless it is a new column? Is this what Stata does?
I have thought about @pdeffebach request and I think it would make sense however, we need to make a decision. We have two possible rule sets
Rule 1 (simpler)
Rule:
- if in some operation there is a single input column and a single output column then metadata of source column replaces metadata of target column (with the exception of
unstack- see its docstring)- if in some operation there are multiple input columns metadata in the output is dropped (with the exception of
vcatandstack- see its docstring)
Rule 2 (more complex)
Rule:
- if in some operation there is a single input column and a single output column then metadata of source column replaces metadata of target column in
select[!]andtransform[!]but it is dropped in all other operations (likecombineandmapcols)- if in some operation there are multiple input columns metadata in the output is dropped (with the exception of
vcatandstack- see its docstring)
Under both rules we would need to add a clear explanation to the users that if column metadata is invalidated by the transformation they do (e.g. unit of measure and we square column) then they need to manually drop it. I think it is OK, as in this way we "correctly keep metadata in most common cases" and "incorrectly keep it in few cases". While other rules would be more conservative and would mean that we "incorrectly drop metadata in many cases".
Now regarding the choice between rule 1 and rule 2:
- under rule 1 all is consistent and we have simpler rules to learn by users, but we more often incorrectly propagate metadata;
- under rule 2 users need to learn more special cases, especially
combinevsselect/transform, but they get a correct result more often. However "more often" here is soft - the number of cases when this is true is not that much, e.g.combine(df, :x => sum)would drop:xmetadata, whileselect(df, :x => sum)would keep it, but probably it should drop it so things are not clear cut.
I personally would go for rule 1 so have a simpler ruleset.
@nalimilan, @pdeffebach, @matthieugomez, @nilshg, @oxinabox - can you please comment what you think? I would like to close this discussion, implement what we agree on and make a 1.4 release for JuliaCon2022.
To add context to the discussion: we have been discussing adding metadata for years now. There is no one solution that will work in all cases. We need to make some decision and go for it (and I document that metadata handling might change in the future and it will not be considered a breaking change, so we can correct things if we need in future releases in 1.x branch). From development perspective this PR needs to be made in one-shot as it affects the whole package, so I have other development stalled because otherwise it is near to impossible to correctly make all the changes (I have already rebased this PR once and it took me several days to do so).
I'm in favor of rule 1. I agree that keeping things simpler/more predictable for users will have the most benefit long term. And we can always tweak a case or two if they come up often as confusing/incorrect. Thanks for the detailed write up @bkamins and for all the work on this.
Could we follow rule 1, but with the additional condition that the output column has the same name as the input? You're much more likely to do :x = coalesce(:x, 0) than :x = :x^2, as the latter completely changes the meaning of the values, which would be confusing if the name is preserved.
Anyway I think we all agree that we'd better merge something relatively simple now, and continue discussing and refining the rules later if necessary.
but with the additional condition that the output column has the same name as the input?
Well - then the problem is that column renaming like :y = :x would then drop metadata. The rules above were designed to make sure that they cover column renaming (and also - in consequence transformations like :y = identity(:x) and :y = copy(:x) which I think should also keep metadata).
Of course it is not a problem to add additional exception for column renaming in the rules, so I am not sure what is better.
Yes then we would have two rules: one for equality of names, and one for identity/copy/rename operations. Not sure whether that's OK or not, would need to read the docs carefully again.
Not sure whether that's OK or not
That is OK from the "logical consistency level". Let us wait for users' feedback (especially @pdeffebach as he has a lot of experience with working with metadata)
@nalimilan I have updated the description of metadata propagation rules in docs/src/lib/metadata.md to follow what we have discussed.
Sorry for the delay on this. I'm with @nalimilan (I think). I would say
:x = :y
shouldn't forward metadata, if we are keeping things simple for now. Adding a rule about when an operation is a simple copy etc. probably makes things too complex. I understand the irony of saying this because I am advocating for a more complex rule in having :x = coalesce(:x, 0) propagate.
But yes overall I think rule 1 is good an makes sense.
I would say
:x = :yshouldn't forward metadata
So what should happen with metadata in this case? Actually there are two cases:
a) :x is already present in the source data frame
b) :x is not present yet in the source data frame
Also then what should happen if you write rename(df, :y => :x) as this is essentially the same operation.
Our current understanding was that column renaming (copying or non-copying) should take metadata from the source column.
In Stata,
gen y = x
does not forward metadata. So that's where my intuition comes from.
Stata has
replace x = x + 1
which does preserve metadata, and is the motivation for my proposed rule in transform. We have no equivalent to replace, and writing a separate replace function doesn't make a lot of sense when transform can have multiple transformations in a single call, i.e.
@rtransform df begin
:x =:x + 1 # preserves metadata
:y = :x ^2 # does not preserve metadata
end
Stata also has rename y x (rename y to x) which does preserve metadata.
yes, but in Stata:
gen y = x
strictly generates new variable AFACIT (I understand that if y were alrady present it would error - right?)
Side question what does:
replace y = x
do?
But apart from these discussions I do not see a reason why gen y = x would not set metadata for y to be equal to the one of x. What is the logic behind such behavior?
Thinking more about it - the Stata logic would be as follows?
- keep metadata of existing columns unchanged;
- new columns have no metadata.
Is this correct? Maybe indeed this is a rule that is good enough? (it is easy to understand and easy to implement)
- keep metadata of existing columns unchanged;
- new columns have no metadata.
Yes. I think this is a clear and consistent rule. Of course, it allows for :x = :x^2. But I think this is a small sacrifice and people can manage that meta-data manually.
@nalimilan (and anyone else willing to) - this PR should be ready for review (or at least friendly testing).
Apart from adding metadata new tests caught rename and rename! bug + I decided to make completecases and nonunique a bit more flexible in corner cases (to make them support rare situations rather than throw an error in them).
Thank you!
This PR should be ready for the next review.
What things need to be verified:
- If we feel that the API using
hasmetadata/hascolmetadata/metadata/colmetadatais convenient enough; if yes then a related PR https://github.com/JuliaData/DataAPI.jl/pull/48 should be merged and this PR updated to use the interface defined in DataAPI.jl. - If we like the metadata propagation rules (we are in agreement with @nalimilan what the rules should be, and I feel I do not want to change them for 1.4 release, but maybe some useful comments would be given).
- Some friendly testing of the PR (which would additionally confirm if we like point 1. and point 2. in practice).
Some specific mentions (if you do not have time to go through the whole PR - which is likely - it is enough to read the new material on metadata that is present in the manual):
- @pdeffebach - the propagation rules we implement are not the same as in Stata, but I think our rules are better - of course it is debatable :)
- @krynju - can you please have a look at this ecosystem, as it would be good to consistently support it in
DTable(and this is the reason why I am waiting with looking at your changes inDTableinterface - this PR does some changes almost every part of DataFrames.jl so it would be good to finalize it before we extract out the interface) - @quinnj - after this PR is merged it would be great to have Arrow.jl support reading/writing the metadata using the defined API, so can you please confirm that you like it?
- @ExpandingMan - after this PR is merged it would be great to have Parquet2.jl support reading/writing the metadata using the defined API, so can you please confirm that you like it?
- @quinnj - additionally my plan is that after this PR I create a small package that is able to inject table metadata into CSVs in TOML format as comments taking advantage of https://csv.juliadata.org/v0.5/#Commented-Rows-1. However, maybe you feel that this is a functionality that makes sense to be a part of CSV.jl (if yes we can discuss separately in CSV.jl issue after this PR is merged)
- @oxinabox - your general comment, from the user's perspective, if adding metadata to DataFrames.jl in the way proposed would be useful are welcome.
This PR adds metadata support, in a follow-up (probably not in 1.4 release) we discuss in #3076 how metadata should be handled when displaying data frames (maybe even this should be moved to PrettyTables.jl for a more general support of metadata display for any table). This will be discussed later with @ronisbr.
While this looks about as reasonable as it can be, I remain convinced of the rather unpopular idea that metadata should not be part of DataFrame, I don't see it as fitting nicely with the "ordered named tuple of like-length rank-1 arrays" concept that is so natural for tables.
The ambiguities arising from row-wise operations are a case-in-point. Even operations involving a single column don't have clear semantics for the metadata. For example, with the design of this PR, transformations such as filtering preserve metadata, but there is a huge class of cases in which this is totally inappropriate. Indeed, I immediately hit an example for Parquet2, parquet metadata includes column statistics none of which are preserved under all the operations listed here. I'm quite uncomfortable with all the other conventions listed here as well, not because I feel they were badly chosen or not well-thought-out, but because they simply don't generalize naturally and it's not at all clear (to me at least) that any of these will turn out to be a good default choice. At best, these conventions will fail spectacularly some of the time.
I remain of the opinion that metadata should not be included in DataFrames.jl but should be implemented in a separate package that depends only on Tables.jl and which is willing to make more strict guarantees about what exactly is permitted.
In the meantime, I'll comment on the compatibility of this PR with parquet. Parquet allows for Dict{String,String} metadata associated with the entire table as well as each column, much as in this PR. Additionally, it allows for some statistics which I currently handle by optionally wrapping the output in a special AbstractVector. All of this I think would be well-accommodated by this PR, but with the aforementioned caveat that literally none of it (I think) would be guaranteed to make any sense after table transformations. Since my AbstractVector types are referred to by a DataFrame if copycols=false, I would probably elect to only copy the arbitrary metadata Dict{String,String} and not include the statistics.
@ExpandingMan - your point is the reason why metadata was postponed for so long. No one is really sure what is best here.
My original opinion was to never keep metadata under any transformation (which would guarantee we do not generate "false positives"). In this approach metadata would be mostly used as a way to attach reference information to a data frame when it is saved/loaded. I guess this would be consistent with your recommendation.
However, under such an approach users complained that a lot of code using metadata would look like:
1. take some data frame
2. perform operation on it
3. copy metadata from source to target
4. repeat steps 2 and 3
so we decided to try defining some metadata propagation rules. We are aware that such rules will produce "false positive" cases, in which user will have to manually wipe metadata.
I am personally not sure what is best. Also - to be clear - although I invested a lot of time into this PR - I am totally fine to drop it (or minimize its scope) if we decide to do so. However, I would welcome if we had a really serious discussion about what we want and after this decision is made to follow it and not go back to it in a future. The point of my effort was to provide a complete implementation of metadata propagation rules that are consistent under assumption that we want to propagate as much metadata as it makes sense so that we have something concrete to discuss about (as opposed to discussion what-if scenarios). In the PR it is clear what are the consequences of the requirement to propagate metadata.
your point is the reason why metadata was postponed for so long. No one is really sure what is best here.
I find that when that sort of thing happens the best thing to do is take the hint and move on.
More broadly, I think what is happening here is that the generalization of "dataframes" (which we by now have defined pretty clearly as ordered named tuples of like-size rank-1 arrays) to "dataframes with metadata" is not only non-unique but there isn't even a "natural" choice of generalization that we could reasonably expect to work for most use cases.
My recommendation would be to, rather than making an attempt at a general implementation that is likely to be plagued with issues, make more specific implementations that are more appropriate for more specific cases. The way I see it, what you are trying to do here can only be adequately achieved by developers creating implementations as needed. The great thing about Julia and DataFrames.jl is that writing such implementations is far easier than it would be with e.g. pandas. Why there are not more existing implementations I can only speculate, but perhaps the best course of action right now is ti write a package that implements the overall structure you are proposing in this PR but which depends only on Tables.jl. This would avoid committing DataFrames.jl to such a speculative design and might turn out to be very revealing through issues that get raised for it.
Anyway, as the tone of the preceding has been quite negative, I want to make sure I say that I hugely appreciate all the work you are doing here. Indeed, the main reason I'm extremely apprehensive about putting something like this in DataFrames.jl is that I consider the package as is quite pristine, and I am worried about it getting degraded by feature creep.
Also, if I cannot talk you out of committing to this, my next suggestion would be to (at least initially) completely discard metadata when any changes are made unless explicitly set otherwise.
I could be wrong but I think it would be easier to start there and then add persistence later than to start out with a bunch of a ad hoc persistence rules and later trying to roll them back. (I believe it would be a lot less breaking.)
I don't think @bkamins needs to maintain the "purity" of DataFrames. Useability should take precedent and as I've argued in this thread above, being over-eager in destroying metadata would make metadata unworkable in practice. Every single transformation would require "forwarding" and it would be a pain to work with.
If you are familiar with Stata's labeling and notes, the workflow in this PR reproduces that workflow in a way that is almost as easy to use as Stata, which is quite a feat.
A Tables.jl-focused re-write of metadata would be problematic
- Current Tables-agnostic transformations in TableOperations.jl are extremely limited. The DataFrames.jl mini-language for transformations is expansive and useful, but is opinionated. Replacing theDataFrames.jl transform calls with tables.jl-neutral ones, just for the purposes of propagating metadata would be a hit to useability.
- It is true that Julia, in theory, allows for many different implementations of DataFrames. But that is not always a good thing. For one, the DataFrames API is very large, and any package that re-implemented it for a concrete type would effectively be rewriting DataFrames. The complicatedness of DataFrames is the reason we do not have separate implementations. This is a good thing, though. Standardization is valuable. It would be a shame if we had a fractured ecosystem just because some people disagree about a few API opinions about metadata.
The ambiguities arising from row-wise operations are a case-in-point. Even operations involving a single column don't have clear semantics for the metadata.
There are many scenarios with unclear semantics ex-ante. But we cannot limit ourselves only to semantics which are ex-ante clear. Rather, DataFrames.jl gets to document the semantics and standardize them.
FWIW, I just checked out the branch and built the docs. I think the API is great.
Here are two helper functions I quickly wrote to make things a bit easier.
julia> function label!(df, p::Pair)
name = first(p)
lab = last(p)
colmetadata(df, name)["label"] = lab
return df
end;
julia> function labels(df)
n = names(df)
out = DataFrame(Column = n)
labels = map(n) do ni
cdict = colmetadata(df, ni)
get(cdict, "label", missing)
end
out.Label = labels
out
end;
julia> df = DataFrame(sales_1w_end = rand(100));
julia> @chain df begin
@rtransform! :log_sales = log(:sales_1w_end)
label!(:log_sales => "Logarithm of sales")
end;
julia> labels(df)
2×2 DataFrame
Row │ Column Label
│ String String?
─────┼──────────────────────────────────
1 │ sales_1w_end missing
2 │ log_sales Logarithm of sales
I think a helper package to force everyone to agree on label etc. would be helpful. I would be happy to write it.
I have opened a pool on Discourse for this topic at https://discourse.julialang.org/t/dataframes-jl-metadata/84544. Anyone interested is invited to vote. The objective of the pool is to find out what the community in general prefers. I hope it will help us to reach a good decision.
In the meantime, I'll comment on the compatibility of this PR with parquet. Parquet allows for
Dict{String,String}metadata associated with the entire table as well as each column, much as in this PR. Additionally, it allows for some statistics which I currently handle by optionally wrapping the output in a specialAbstractVector. All of this I think would be well-accommodated by this PR, but with the aforementioned caveat that literally none of it (I think) would be guaranteed to make any sense after table transformations. Since myAbstractVectortypes are referred to by aDataFrameifcopycols=false, I would probably elect to only copy the arbitrary metadataDict{String,String}and not include the statistics.
It seems to me that only the former (Dict{String,String}) should be considered as metadata in the sense of this PR (and also in the sense of R or Stata). Statistics on the data should be handled differently (like with a custom vector type as you currently do). Indeed a major reason to support metadata rather than requiring users to handle a dict separately is to be able to propagate it automatically in at least some operations, and taking a subset is the poster child for this. Adopting a definition of metadata that doesn't allow propagating it when subsetting rows/columns would be self-defeating IMO.
I think it would be good to add to the documentation what happens in groupings (metadata is preserved).
@tp2750
It is documented here https://github.com/JuliaData/DataFrames.jl/pull/3055/files#diff-5ec522580bc313447378c5cab2f8fb37d72aeba71dbb98ca8d3b50a05e2f163aR10
how would you propose to improve it?
Statistics on the data should be handled differently (like with a custom vector type as you currently do). Indeed a major reason to support metadata rather than requiring users to handle a dict separately is to be able to propagate it automatically in at least some operations, and taking a subset is the poster child for this. Adopting a definition of metadata that doesn't allow propagating it when subsetting rows/columns would be self-defeating IMO.
That we are quickly led to the conclusion that statistics cannot be included in metadata seems like a red flag to me. When you exclude both statistics and the kinds of type information that are handled by the AbstractVector container what you are left with are cases that look suspiciously application-specific. Can we even generate many good examples of use cases for this? (there may have been extensive discussion on this before that I have missed)
@pdeffebach , actually I realize I more-or-less agree with the second half of this comment. I should not have said an implementation that requires only Tables.jl, that indeed sounds overly optimistic. I am more concerned about putting a highly dubious feature into DataFrames.jl that has to be maintained and we are stuck with for all eternity, any experiment with metadata that does not lead to a similarly bad result would seem fine to me.
OK - the monster PR is ready for your checking.
The design follows https://github.com/JuliaData/DataAPI.jl/pull/48:
:nonestyle metadata (both table and column level) is never propagated except inDataFrameconstructor andcopy:notestyle metadata is propagated following the rules discussed in this PR
@nalimilan - I am finishing applying patches. codecov misses will be gone after DataAPI.jl is released.
Maybe we should say "table-level", "column-level" and ":note-style" (with dashes) everywhere as it makes reading complex expressions like "table-level and column-level :note-style metadata" easier? I'd like a confirmation by a native speaker.
This - thing is more complex (I am finishing copyediting of https://www.manning.com/books/julia-for-data-analysis so I had a lot of such discussions). However, I propose to just switch to the style you outlined for improved readability (without going into details of English grammar as they sometimes imply - should be used and sometimes it should not be used in these cases AFAICT).
@nalimilan tests will fail on Julia 1.0, but let us ignore it. We will soon bump the required Julia versions to at least 1.6 before 1.4 release. But I want to do it in a separate PR.
@nalimilan - I am done with the updates after your review. metadata.jl is significantly refactored. I have resolved all comments that I believe are clear. The only open is https://github.com/JuliaData/DataFrames.jl/pull/3055#discussion_r967794601, but I think it is OK what I do, so this also can be resolved if you agree.