Query.jl
Query.jl copied to clipboard
Feature Request - Normalize Names
CSV.jl has a normalizenames
argument that can be set to true
when reading a file. This option replaces invalid identifier characters (spaces) with underscores. I think it would be nice to add this sort of a feature to Query, but I would take it a step further and remove all trailing/leading whitespaces (rather than replacing them with underscores). From the CSV.jl documentation:
"When a column name is not a single atom Julia identifier, this is inconvenient, because f.column one
is not valid, so I would have to manually call getproperty(f, Symbol("column one")
"
Julia's built-in strip
and replace
functions should do the job. I'd love to make an attempt to write this myself if you can provide a basic roadmap for me to get started.
Thanks!!
Alright, here is how I think we should do this:
First, we add a function like normalize_names
to here. It should simply take a named tuple as an input argument, and return a named tuple in which all the names are normalized with the same values. This function will have to be type stable, and to achieve that it will have to be a generated function.
With that function alone, one should already be able to normalize names a la source |> @map(normalize_names(_)) |> ...
.
And then we simply add a new macro to here that just translates @normalize_names()
into @map(normalize_names(_))
.
I think you would start with the first part, get that all sorted out, and then in a second step we can add the macro.
If you haven't written a generated function before, let me know, and I'll give more tips on how to do that. You can also take a look at some of the other functions in that file for patterns.
I took a look at the NamedTupleUtilities module and there is quite a bit of syntax that looks foreign to me 😨. I have a lot of researching/learning to do before I'll be able to write a function that can accomplish the task at hand. Also, I checked out the CSV.jl source code and there is quite a bit going on there that I also don't understand (it's much more than simply replacing spaces, they are replacing other kinds of characters as well).
I didn't get far at all before getting completely stuck 😳:
function normalize_names(a::NamedTuple)
normalize(name) = replace(strip(string(name)), " " => "_")
names_normalized = normalize.(keys(a))
return names_normalized
end
mynames = (Symbol(" queryverse rocks"), :b, :c)
myvalues = [1, 2.0, "hello world"]
my_namedtuple = NamedTuple{mynames}(myvalues)
normalize_names(my_namedtuple)
Output:
("queryverse_rocks", "b", "c")
This function works, but it obviously doesn't do anything close to what we want it to : )
I haven't written generated functions before and I actually haven't written any code that deals with NamedTuples. In looking at the select
function, for example:
@generated function select(a::NamedTuple{an}, ::Val{bn}) where {an, bn}
names = ((i for i in an if i == bn)...,)
types = Tuple{(fieldtype(a, n) for n in names)...}
vals = Expr[:(getfield(a, $(QuoteNode(n)))) for n in names]
return :(NamedTuple{$names,$types}(($(vals...),)))
end
I don't really understand what an
and bn
are (or how they work), and I also am unfamiliar with the comma after the splat operator ...,
, and again in ($(vals...),)
.
My (mis)interpretation of the select
function goes something like this:
The a
argument that the function accepts is a NamedTuple with the names being stored in the tuple an
. I guess that ::Val{bn}
is a name provided by the user, but I'm really not sure what it is. I'm also unsure about what the where {an,bn}
bit does. From there, I think you are creating a names
variable that checks to see if the name(s) bn
exist(s) in the NamedTuple a
. You then create a variable to store the types of each field and another variable vals
for storing an array of the NamedTuple's values.
Lastly, the function returns a NamedTuple of just the names/values that the user wanted, as specified by ::Val{bn}. Unfortunately, it all appears very cryptic to a novice like myself!!
I'll keep researching and tinkering with my code to see how much closer I can get to the desired outcome.
Yeah, this stuff is not easy :) The following code might be a useful template:
function normalize_name(x)
return uppercase(x)
end
@generated function normalize_names(x::NamedTuple{NAMES,TYPES}) where {NAMES, TYPES}
new_names = (Symbol(normalize_name(string(i))) for i in NAMES)
return :(NamedTuple{$(tuple(new_names...))}(values(x)))
end
The normalize_names
function should be complete as I've written it down, so it should be enough for you to turn the normalize_name
function into something that does a useful thing, namely take a string with a name and return a string with the normalized name (instead of just uppercasing everything).
If CSV does a lot more in terms of normalizing the actual name than just replacing spaces, maybe we can just copy the code from there?
I can also handle the macro, that is more boiler plate code that is tricky to get right if you are unfamiliar with it, but should be really easy for me. But if you can write the actual normalization code (and tests and docs etc.) that is already a huge help!
Thanks, David! The below seems to be working fine for me (I copied the code from CSV.jl and merged it with your example above):
using Unicode
const RESERVED = Set(["local", "global", "export", "let",
"for", "struct", "while", "const", "continue", "import",
"function", "if", "else", "try", "begin", "break", "catch",
"return", "using", "baremodule", "macro", "finally",
"module", "elseif", "end", "quote", "do"])
function normalize_name(name::String)::Symbol
uname = strip(Unicode.normalize(name))
id = Base.isidentifier(uname) ? uname : map(c->Base.is_id_char(c) ? c : '_', uname)
cleansed = string((isempty(id) || !Base.is_id_start_char(id[1]) || id in RESERVED) ? "_" : "", id)
return Symbol(replace(cleansed, r"(_)\1+"=>"_"))
end
@generated function normalize_names(x::NamedTuple{NAMES,TYPES}) where {NAMES, TYPES}
new_names = (Symbol(normalize_name(string(i))) for i in NAMES)
return :(NamedTuple{$(tuple(new_names...))}(values(x)))
end
With this, I can do the following:
mynames = (Symbol(" queryverse rocks"), :b, :c) # note that " queryverse rocks" is not normalized
myvalues = [1, 2.0, "hello world"]
my_namedtuple = NamedTuple{mynames}(myvalues)
normalize_names(my_namedtuple)
Output:
(queryverse_rocks = 1, b = 2.0, c = "hello world") # yay! it's normalized!
I will attempt to write a test for the above and then get back to you in a couple of days. I can definitely handle writing the documentation so I will also take care of that and get it to you in the next couple of days.
Cool! I think we’ll need tests in QueryOperators, but no docs there. But then we’ll need docs in Query for the macro (and tests there as well).
Here's an attempt at the documentation:
The @normalize_names command
The @normalize_names
command has the form source |> @normalize_names()
. source
can be any source that can be queried. The command will normalize column names by replacing invalid identifier characters with underscores to ensure each column is a valid Julia identifier.
Example
using Query
names = (Symbol(" queryverse rocks"), Symbol("¡column #2!"), :c)
values = [1, 2.0, "hello world"]
source = NamedTuple{names}(values)
q = source |> @normalize_names() |> collect
println(q)
# output
(queryverse_rocks = 1, _column_2! = 2.0, c = "hello world")
I'm not sure exactly how to go about the unit tests, but here's something that works:
names = (Symbol(" queryverse rocks"), Symbol("¡column #2!"), :c)
values = [1, 2.0, "hello world"]
source = NamedTuple{names}(values)
@test QueryOperators.NamedTupleUtilities.normalize_names(source) == (queryverse_rocks = 1, _column_2! = 2.0, c = "hello world")
@inferred QueryOperators.NamedTupleUtilities.normalize_names(source)