grass icon indicating copy to clipboard operation
grass copied to clipboard

[Feat] Import CSV files with column headers and correct column data types

Open cmbarton opened this issue 1 year ago • 8 comments

CSV files are the most common and widely readable file format for tabular data. But GRASS has no easy way to import them correctly.

db.in.ogr now imports CSV files with correct column headers (see #943). But it imports all columns as strings, regardless of the data in the columns. Because tables imported this way do not have any numeric columns they cannot be used to generate points or to connect with an existing spatial dataset via a common cat field

v.in.ascii can import CSV files and create vector points. It does correctly recognize data types in CSV columns--including a CAT column. But for unknown reasons it does not import the column headers. Of course, you can create new column headers, but this becomes a PITA with many columns--especially as it requires an arcane format to do so.

A grass extension v.in.csv does some of this. However, it requires a separate Python package, pyproj, to do so. It also does not automatically recognize data types in columns like v.in.ascii does, although you can specify which columns are numeric (integer or real). It also seems to assume that the CSV table only has point coordinates in latlon.

So db.in.ogr needs to recognize column data types, or allow a user to specify them, and v.in.ascii needs to recognize column headers. Ideally, perhaps, would be a combination of these--something that could import a CSV table correctly and optionally create vector points from that table, optionally identify a CAT column, and optionally allow a user to define column data types to override the automatic data types recognition.

Related issue: #943

cmbarton avatar Oct 04 '24 18:10 cmbarton

Did you see https://grass.osgeo.org/grass-stable/manuals/db.in.ogr.html#import-csv-file (using .cvst file for column type recognition)?

neteler avatar Oct 04 '24 21:10 neteler

Yes. I've done that. It is equally a pain, especially with multi-column tables and sometimes does not work correctly. Most other data analytical programs manage to open and read these ASCII text files pretty well. v.in.ascii comes very close and is only missing recognizing column names for creating points. If it had the ability to do this and the option to import the table without creating points, it would cover >95% of the cases.

cmbarton avatar Oct 04 '24 21:10 cmbarton

The OGR CSV driver has a AUTODETECT_TYPE=YES open option

$ printf "id,float_val,my_str\n1,2.5,foo\n" > test.csv
$ ogrinfo test.csv -oo AUTODETECT_TYPE=YES  -al -q

Layer name: test
OGRFeature(test):1
  id (Integer) = 1
  float_val (Real) = 2.5
  my_str (String) = foo

rouault avatar Oct 05 '24 16:10 rouault

Interesting. Could this be a flag in db.in.ogr? To test, is this something that can be entered in the ogr_doo argument?

cmbarton avatar Oct 05 '24 17:10 cmbarton

To test, is this something that can be entered in the ogr_doo argument?

don't know the GRASS side of things, but from https://grass.osgeo.org/grass84/manuals/v.in.ogr.html, I assume this should be gdal_doo=AUTODETECT_TYPE=YES

rouault avatar Oct 05 '24 18:10 rouault

Yes. This works!. It should be the default in db.in.ogr, perhaps with a flag to disable it and use a *.cvst instead.

Too bad that we can't do this with v.in.ogr, since it won't accept cvs files with defined x and y coord. columns.

But if v.in.ascii can just keep the header field, then that would be a good alternative too.

cmbarton avatar Oct 06 '24 02:10 cmbarton

@cmbarton I fully agree with your points. One thing is that the column names are stored in the history of the vector layer. I remember I used that in a Python script to rename the names of the attribute table back to the original names. It was something similar to the code below. An admittedly quick and dirty solution :-/

# Get the history, filter out the column names, split them into a list
a = list(gs.parse_command("v.info", map="AAA", flags="h").keys())[-1].split("|")

# Get a list with the column names of the vector layer imported using v.in.ascii
# and remove 'cat' from the list.
b = list(gs.parse_command("db.columns", table="AAA").keys())[1:]

# Replace the column names with the original names.
for id, name in enumerate(b):
    gs.parse_command("v.db.renamecolumn", map="AAA", column=f"{name},{a[id]}")

ecodiv avatar Oct 06 '24 20:10 ecodiv

Thanks for the workaround .

cmbarton avatar Oct 09 '24 16:10 cmbarton

Splitting this issue into 2 separate feature requests #4593 and #4594 because it deals with 2 different GRASS commands.

cmbarton avatar Oct 25 '24 21:10 cmbarton