gis-tools-for-hadoop icon indicating copy to clipboard operation
gis-tools-for-hadoop copied to clipboard

earthquakes.csv has different schema than sample expects

Open ddkaiser opened this issue 9 years ago • 2 comments

The “create table earthquakes” instructions given at: https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples/point-in-polygon-aggregation-hive no longer aligns with the schema of the data located at: https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples/data/earthquake-data

(I’m guessing that the earthquake-data is occasionally pulled from a USGS or similar source, and they changed their column definitions?)

I had to insert an additional column “unknown” of type double in front of the Magnitude column.

For example, the instructions provide the following schema:

(earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE)

and a random sample line from the file (the unknown column is 80.0 and the magnitude is 6.5):

1930/12/06 07:03:28.00,53.0,-172.0,80.0,6.5,ML,0,,,,AK,

The schema that I used:

(earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, unknown DOUBLE, magnitude DOUBLE)

If corrected, this works:

hive> select min(magnitude), max(magnitude) from earthquakes;
OK
5.0 9.1

If magnitude still points to the wrong column, you will see:

hive> select min(magnitude), max(magnitude) from earthquakes;
OK
-5.0    700.0

ddkaiser avatar Mar 05 '15 03:03 ddkaiser

The version of earthquakes.csv with header row, contains the following header:

datetime,latitude,longitude,depth,magnitude,magtype,nbstations,gap,distance,rms,source,eventid

randallwhitman avatar Mar 05 '15 16:03 randallwhitman

The DDL (in the README and in run-sample.sql) matches a column-subset variant of the data that we also had. The mismatch can be resolved either by updating the DDL in both files - or by uploading the column-subset version of the earthquake data.

randallwhitman avatar Mar 05 '15 17:03 randallwhitman