StatsModels.jl
StatsModels.jl copied to clipboard
Disruptive Variable Names
using Feather
using DataFrames
using StatsModels
df = Feather.read(joinpath(Pkg.dir("Feather"), "test/newdata/iris.feather"),
use_mmap = false)
names(df)
The names are: Vector{Symbol}(:Sepal.Length, :Sepal.Width, :Petal.Length, :Petal.Width, :Species)
Having variable names with . makes them Expr with ex.head == :(.) which prompts errors in:
dospecialsTerms
The following fix works:
function dospecials(ex::Expr)
if ex.head == :(.) return Symbol(ex) end
...
end
function Terms(f::Formula)
...
if haslhs
unshift!(etrms, Any[Symbol(f.lhs)]) # Shouldn't typeof(eterms) == Vector{Symbol} ?
end
While this fix is only for variable names that have . it might be wise to decide against supporting names that cannot be framed as a :... such as having special operators for StatsModels (e.g., Symbol("A*B")), special operators in Julia (e.g., Symbol("A.B"), spaces (e.g., Symbol("A B")) or others such as Symbol(""). As long as is documented I think it would be fine... The proposed approach would allow _ as a separator for compound words.
This seems like an issue with Feather rather than StatsModels. DataFrames provides an option to normalize column names to valid Julia identifiers upon reading. Feather should probably have an option like that as well.
I verified that with CSV the default is to keep the variable names as they are in the original file. For example:
This is a var,Value
1,0
will import using CSV.read with names Vector{Symbol}(Symbol("This is a var"), :Value)
DataFrames depreciated readtable in favor of using I/O packages directly. Do you remember which function normalized column names? If it still exported, it could be used and just document that step.
It wasn't a separate function but rather an argument to readtable in DataFrames. I'd be surprised if that functionality isn't available in CSV or Feather. cc @quinnj
CSV nor Feather seem to have it. I found it in the DataFrames other/utils.jl.
The function is identifier(s::AbstractString) which isn't exported. It might be worthwhile to add this to the documentation as a line,
names!(df, Symbol.(DataFrames.identifier.(string.(names(df)))))
Even if normalizing names by default is a good idea, it would be nice to support non-standard names if that's possible, as that makes generic code more robust.
I believe that for handling non-normalized names the minimum requirements might be:
- Names can't have leading or trailing white spaces.
- Names must have a minimum character length of one (i.e.,
Symbol("")not allowed) - Operators in formula must have a leading and trailing space (i.e.,
@formula(y ~ a & b)rather than@formula(y ~ a&b))
Then Terms could potentially be used with colnames::Vector{Symbol} = names(df) in order to make a mapping and operate by column position rather than names.
I believe the requirements are not unreasonable. There might be some caveats, for example with a an expression such as Symbol("ID & Year") where it might be unclear for a parser and would have to make a greedy expression to correctly parse the terms based of the column names in the off-chance that names(df) = Vector{Symbol}(["ID", "Year", "ID & Year"]). The example might indicate a need to fall back on normalizing it first and then map back as an alternative.