Normalize (or rename) headers
This is specifically with regards to csvsql, where loading a CSV file with Some manually entered - header (TM) will give you a data structure that is really hard to query. But I think having the ability to essentially slugify and transliterate a headers would be useful for other tools like csvclean aswell.
I tried looking into how this could be done, and it looks like parse_column_identifiers (here) could be an appropriate place, but that would require pulling the option through many intermediate functions. Is there a better place?
A non-csvkit solution would be to first pipe the CSV file through a little script I wrote called header (https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/header). For example:
cat manual.csv | header -r "some,sane,column,names" | csvsql --query "..."
Hope this helps in the meantime.
Other options if you don't have access to additional utilities:
$ { head -1 input.csv | sed -e 's/bad/good/g' ; tail +2 input.csv ; } | csvsql
or
$ { echo 'sane,column,names' ; tail +2 input.csv ; } | csvsql
I would also like this feature -- specifically for making CSV header rows lowercase when appropriate.
@pudo Want to take a stab at a PR?
I'd be in favor of finding a way to make this part of agate, either as a flag to the table constructor or as a slugify method that forks a table. That would make it a trivial addition to any tool that uses the table. Will open a ticket on agate.
@smari , in case you want a solution now [edit: I should have payed attention to when you actually commented here], here's a way to lowercase the column names using header (which, I now see, I also mentioned above):
$ cat iris.csv | head -n 5
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
$ cat iris.csv | header -e "tr '.[:upper:]' '_[:lower:]'" | head -n 5
"sepal_length","sepal_width","petal_length","petal_width","species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
PS. I'm not saying anything about whether or not this should be a feature of either csvkit or agate, I'm just trying to be helpful :)
Copying over this comment from #525...
In case it helps anyone else, you can lowercase all column names in a Postgres database after you import using this magical incantation from Stack Overflow:
\t on
select 'ALTER TABLE '||'"'||table_name||'"'||' RENAME COLUMN '||'"'||column_name||'"'||' TO ' || lower(column_name)||';'
from information_schema.columns
where table_schema = 'public' and lower(column_name) != column_name
\g /tmp/go_to_lower
\i /tmp/go_to_lower
I'm very much in favor of this feature, but I'm not sure where it should live since it affects so many things. Maybe it's part of csvformat. Implementation should be pretty straightforward, leveraging agate.utils.slugify and the column_names argument to Table.from_csv.
Waiting on https://github.com/wireservice/agate/issues/660 in case agate API changes.
May be able to be inspired from https://github.com/wireservice/csvkit/compare/master...culebron:master
With https://github.com/wireservice/csvkit/compare/master...culebron:master as a guide, this is my implementation:
https://github.com/smnorris/bin/blob/master/shampoo
Another way to lowercase header:
awk 'NR==1{$0=tolower($0)} 1' input.csv > output.csv
Related: wireservice/agate#668
A non-csvkit solution would be to first pipe the CSV file through a little script I wrote called header
@jeroenjanssens, I found it in your dsutils repo: https://github.com/jeroenjanssens/dsutils/header
I have list of account IDs in a bash variable that I want to write to CSV:
111111111111
222222222222
333333333333
Now I can use your header command to set the header like this.
printf "$accounts" \
| in2csv --format csv --no-header-row --no-inference \
| header -r account_id
Note that I'm actually using in2csv first to be reasonably sure that the input is a valid, single-column CSV file without a header. in2csv even adds the header "a" here, but I can't find a way to change that value!
So I use header -r to replace the header with the one I want.
Here's what in2csv would output:
a
111111111111
222222222222
333333333333
And here's what header would output:
account_id
111111111111
222222222222
333333333333
It seems like the right place for csvkit to support would be a new option in the in2csv command that works like the header -a and header -r options. Something like --add-header-row and --replace-header-row? These options along with --no-header-row would be mutually exclusive.
A way to add a header on a pipe output in a single command without external tooling:
csvcut -c '3,10,7' input.csv \
| ( echo "col1,col2,col3" ; cat -; ) > output.csv
csvkit 2.0.0 adds csvclean --header-normalize-space as one simple normalization.
Some other potential normalizations implemented here: https://github.com/dannguyen/csvmedkit/blob/main/csvmedkit/utils/csvnorm.py
For inspiration, csvmedkit has a csvheader command: https://github.com/dannguyen/csvmedkit/blob/main/csvmedkit/utils/csvheader.py