DataFrames.jl
DataFrames.jl copied to clipboard
using propertynames on GroupedDataFrame
The documentation discusses the fact that one can use either names and propertynames to get the name of variables in DataFrame, where the first function returns a vector of strings and the second one returns a vector symbols. However, for a GroupedDataFrame, propertynames returns (:parent, :cols, :groups, :idx, :starts, :ends, :ngroups, :keymap, :lazy_lock). I found this confusing and to solve this, I think it would be good to define
Base.propertynames(df::GroupedDataFrame) = propertynames(parent(df))
I understand you have an issue about the inconsistency between names and propertynames.
The current design is based on the following reasoning:
- the
namesfunction does not promise anything, it only promises to return names of columns. Therefore it can be defined onGroupedDataFrame. - the
propertynameson the other hand has a promise. It promises that if you writegdf.xwhen:xwas returned as a property ofgdfobject you get a meaningful result. The problem is that it is not clear whatgdf.xshould return ifgdfis aGroupedDataFramehaving a column:x. We cannot just returnparent(gdf).xas it is not the same. For this reason currentlypropertynamesandgetpropertywas left untouched from the default.
However, for the future we could discuss changing it provided that we find some meaningful value that would be returned by gdf.x.
Maybe an alternative would be to have a better suggestion than propertynames to return the column names as symbol? Does Symbol.(names()) always return the right answer? If so, I would suggest to use that syntax in the doc instead of propertynames
you can use propertynames(parent(gdf)). I think this is the preferred pattern (as it falls back to propertynames and gives a right context that gdf is a view of its parent. I understand you would put this suggestion in names docstring?
Yes, that would improve things.
That being said, I think this syntax is quite complicated to remember (and, tbh, that the name propertynames is a bit weird in the context of DataFrames). Would it be possible to create/suggest an alternative syntax? For instance, could not you recommend people to use Symbol.(names) instead of propertynames? If this gives an incorrect result in some situation (which would actually be nice to flag so that people can avoid bugs), how about mimicking CSV.read by adding an optional argument to the function names, such as names(df, Symbol)?
propertynamesis a bit weird in the context of DataFrames
This is not a function introduced by DataFrames.jl. It is a Base function that returns names of properties of an object. And column names as symbols are names of properties of DataFrame.
In general the current design is to use strings as column names as people found it easier to work with. The use of Symbol is a bit of a legacy (indeed, it is a bit faster, but the difference is minimal).
We could add Symbol.(names(df)) suggestion.
Great — I think suggesting Symbol.(names(df)) is the best option