csvkit icon indicating copy to clipboard operation
csvkit copied to clipboard

Normalize (or rename) headers

Open pudo opened this issue 11 years ago • 16 comments

This is specifically with regards to csvsql, where loading a CSV file with Some manually entered - header (TM) will give you a data structure that is really hard to query. But I think having the ability to essentially slugify and transliterate a headers would be useful for other tools like csvclean aswell.

I tried looking into how this could be done, and it looks like parse_column_identifiers (here) could be an appropriate place, but that would require pulling the option through many intermediate functions. Is there a better place?

pudo avatar Apr 07 '15 18:04 pudo

A non-csvkit solution would be to first pipe the CSV file through a little script I wrote called header (https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/header). For example:

cat manual.csv | header -r "some,sane,column,names" | csvsql --query "..."

Hope this helps in the meantime.

jeroenjanssens avatar Apr 07 '15 19:04 jeroenjanssens

Other options if you don't have access to additional utilities:

$ { head -1 input.csv | sed -e 's/bad/good/g' ; tail +2 input.csv ; } | csvsql

or

$ { echo 'sane,column,names' ; tail +2 input.csv ; } | csvsql

fitnr avatar Aug 11 '15 19:08 fitnr

I would also like this feature -- specifically for making CSV header rows lowercase when appropriate.

smari avatar Sep 28 '15 15:09 smari

@pudo Want to take a stab at a PR?

jpmckinney avatar Jan 23 '16 06:01 jpmckinney

I'd be in favor of finding a way to make this part of agate, either as a flag to the table constructor or as a slugify method that forks a table. That would make it a trivial addition to any tool that uses the table. Will open a ticket on agate.

onyxfish avatar Feb 05 '16 18:02 onyxfish

@smari , in case you want a solution now [edit: I should have payed attention to when you actually commented here], here's a way to lowercase the column names using header (which, I now see, I also mentioned above):

$ cat iris.csv | head -n 5
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
$ cat iris.csv | header -e "tr '.[:upper:]' '_[:lower:]'" | head -n 5
"sepal_length","sepal_width","petal_length","petal_width","species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"

PS. I'm not saying anything about whether or not this should be a feature of either csvkit or agate, I'm just trying to be helpful :)

jeroenjanssens avatar Feb 05 '16 19:02 jeroenjanssens

Copying over this comment from #525...

In case it helps anyone else, you can lowercase all column names in a Postgres database after you import using this magical incantation from Stack Overflow:

\t on
select 'ALTER TABLE '||'"'||table_name||'"'||' RENAME COLUMN '||'"'||column_name||'"'||' TO ' || lower(column_name)||';' 
from information_schema.columns 
where table_schema = 'public' and lower(column_name) != column_name
\g /tmp/go_to_lower
\i /tmp/go_to_lower

danvk avatar Oct 17 '16 16:10 danvk

I'm very much in favor of this feature, but I'm not sure where it should live since it affects so many things. Maybe it's part of csvformat. Implementation should be pretty straightforward, leveraging agate.utils.slugify and the column_names argument to Table.from_csv.

onyxfish avatar Dec 29 '16 15:12 onyxfish

Waiting on https://github.com/wireservice/agate/issues/660 in case agate API changes.

jpmckinney avatar Jan 28 '17 18:01 jpmckinney

May be able to be inspired from https://github.com/wireservice/csvkit/compare/master...culebron:master

jpmckinney avatar May 21 '18 15:05 jpmckinney

With https://github.com/wireservice/csvkit/compare/master...culebron:master as a guide, this is my implementation:

https://github.com/smnorris/bin/blob/master/shampoo

smnorris avatar Aug 23 '18 21:08 smnorris

Another way to lowercase header:

awk 'NR==1{$0=tolower($0)} 1' input.csv > output.csv

majestique avatar Oct 11 '18 17:10 majestique

Related: wireservice/agate#668

jpmckinney avatar Jul 14 '21 17:07 jpmckinney

A non-csvkit solution would be to first pipe the CSV file through a little script I wrote called header

@jeroenjanssens, I found it in your dsutils repo: https://github.com/jeroenjanssens/dsutils/header

I have list of account IDs in a bash variable that I want to write to CSV:

111111111111
222222222222
333333333333

Now I can use your header command to set the header like this.

printf "$accounts" \
| in2csv --format csv --no-header-row --no-inference \
| header -r account_id

Note that I'm actually using in2csv first to be reasonably sure that the input is a valid, single-column CSV file without a header. in2csv even adds the header "a" here, but I can't find a way to change that value!

So I use header -r to replace the header with the one I want.

Here's what in2csv would output:

a
111111111111
222222222222
333333333333

And here's what header would output:

account_id
111111111111
222222222222
333333333333

It seems like the right place for csvkit to support would be a new option in the in2csv command that works like the header -a and header -r options. Something like --add-header-row and --replace-header-row? These options along with --no-header-row would be mutually exclusive.

iainelder avatar Jan 21 '22 19:01 iainelder

A way to add a header on a pipe output in a single command without external tooling:

csvcut -c '3,10,7' input.csv \
    | ( echo "col1,col2,col3" ; cat -; ) > output.csv

Morikko avatar Nov 16 '23 15:11 Morikko

csvkit 2.0.0 adds csvclean --header-normalize-space as one simple normalization.

Some other potential normalizations implemented here: https://github.com/dannguyen/csvmedkit/blob/main/csvmedkit/utils/csvnorm.py

For inspiration, csvmedkit has a csvheader command: https://github.com/dannguyen/csvmedkit/blob/main/csvmedkit/utils/csvheader.py

jpmckinney avatar May 02 '24 17:05 jpmckinney