gis-tools-for-hadoop
gis-tools-for-hadoop copied to clipboard
earthquakes.csv has different schema than sample expects
The “create table earthquakes” instructions given at: https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples/point-in-polygon-aggregation-hive no longer aligns with the schema of the data located at: https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples/data/earthquake-data
(I’m guessing that the earthquake-data is occasionally pulled from a USGS or similar source, and they changed their column definitions?)
I had to insert an additional column “unknown” of type double in front of the Magnitude column.
For example, the instructions provide the following schema:
(earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE)
and a random sample line from the file (the unknown column is 80.0 and the magnitude is 6.5):
1930/12/06 07:03:28.00,53.0,-172.0,80.0,6.5,ML,0,,,,AK,
The schema that I used:
(earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, unknown DOUBLE, magnitude DOUBLE)
If corrected, this works:
hive> select min(magnitude), max(magnitude) from earthquakes;
OK
5.0 9.1
If magnitude still points to the wrong column, you will see:
hive> select min(magnitude), max(magnitude) from earthquakes;
OK
-5.0 700.0
The version of earthquakes.csv with header row, contains the following header:
datetime,latitude,longitude,depth,magnitude,magtype,nbstations,gap,distance,rms,source,eventid
The DDL (in the README and in run-sample.sql
) matches a column-subset variant of the data that we also had. The mismatch can be resolved either by updating the DDL in both files - or by uploading the column-subset version of the earthquake data.