gis-tools-for-hadoop copied to clipboard
Tutorial not working
Hi guys,
I am trying to run your tutorials but I don´t know whats wrong. At first, I am new, really beginner in these things like Hadoop etc.
I uses this, exactly same what is written there:
Now samples ... I tried but when I type last query (select, join, group, order ...) I have this :
Warning: Map Join MAPJOIN[18][bigTable=earthquakes] in task 'Stage-2:MAPRED' is a cross product
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
so I dont have any results ...
Another sample example is Feature Class to HDFS ( ... I am in step 6 - 7 ... I should write that DROP TABLE
in Cygwin right? I did it, and have result OK
, then I write describe formatted input_ex
and nothing happens ... whats wrong with that? I am going step by step like a child ... It could be problem between keyboard and chair (me) but I do everything based on your tutorial ...
EDIT: Feature Class to HDFS is working now ... I just forgot to write ;
and the end of describe formatted input_ex
... I didn´t know that it should be there and in tutorial this ;
is missing
Thanks for advice :)
Hi @tikos,
You will always get the warning:
Warning: Map Join MAPJOIN[18][bigTable=earthquakes] in task 'Stage-2:MAPRED' is a cross product
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
when you run:
SELECT, count(*) cnt FROM counties
JOIN earthquakes
WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
ORDER BY cnt desc;
It's ok that the warning occurs, you just need to wait a bit longer to see the results. You'll know the process is done when you can see the >hive
cursor. Like the following:
It may be a bit easier to tell the the process is continuing to run if on the first command from the sample, you use hive
instead of hive -S
Thank you for the correction to the Feature Class to HDFS - the tutorial has been updated.
Hi @smambrose,
I didn´t know that, and I didn´t see hive> ... I will try it again withou using hive -S command.
Anyway I was trying Aggregation sample (taxi demo) and when I tried aggregation (step 9) it was for looooooooooooooong time and the process bar (map and %) was like:
0%, after 30 minutes 0%, after another 30 minutes 89% and 3 times the same (only Map), in Reduce was 0% ... This is okey? because I hade to turn off my lapton I was using for it so I don´t know ..
but then I tried it again, skip step 9 and use 10 and 11 and get this:
Could not find job application_1430678691120_0002. The job might not be running yet.
Job job_1430678691120_0002 could not be found: {"RemoteException":{"exception":"NotFoundException","message":"java.lang.Exception: job, job_1430678691120_0002, is not found","javaClassName":"org.apache.hadoop.yarn.webapp.NotFoundException"}} (error 404)
thanks (sorry if this is so easy for you guys, I am new ... BTW: can I use netCDF, HDF, GRIB formats in these tools? or it is just for csv, json ... ?)
I haven't tried running the taxi aggregation sample on a Sandbox yet - but will try to in the next day or so. On a 16 node cluster I was able to run step 9 in ~17 seconds.
You might want to try changing the values of 0.001 -> .1, which might be able to run a bit faster, such that: step 9 would be:
FROM (SELECT ST_Bin(.1, ST_Point(dropoff_longitude,dropoff_latitude)) bin_id, *FROM taxi_demo) bins
SELECT ST_BinEnvelope(.1, bin_id) shape,
COUNT(*) count
GROUP BY bin_id
limit 1;
I would make sure the sample runs before doing the aggregation though. I wasn't able to reproduce the error you received after step 10/11. What was returned after step 10?
Currently, gis-tools for-hadoop is for vector data (points, lines and polygons), not raster.
16 node cluster 17 second, nice ...
I changed that values and get error, nothing was returned ...
That vector/raster thing ... Do you consider to make it availiable also for rasters? If I am not wrong, hadoop (or some enhancement) could run netCDF, HDF or just raster data?
2015-05-04 20:44 GMT+02:00 Sarah Ambrose [email protected]:
I haven't tried running the taxi aggregation sample on a Sandbox yet - but will try to in the next day or so. On a 16 node cluster I was able to run step 9 in ~17 seconds. [image: image]
You might want to try changing the values of 0.001 -> .1, which might be able to run a bit faster, such that: step 9 would be:
FROM (SELECT ST_Bin(.1, ST_Point(dropoff_longitude,dropoff_latitude)) bin_id, FROM taxi_demo) bins SELECT ST_BinEnvelope(.1, bin_id) shape, COUNT() count GROUP BY bin_id limit 1;
I would make sure the sample runs before doing the aggregation though. I wasn't able to reproduce the error you received after step 10/11. What was returned after step 10?
Currently, gis-tools for-hadoop is for vector data (points, lines and polygons), not raster.
— Reply to this email directly or view it on GitHub .
bc. Tomáš DROTTNER
Katedra geoinformatiky * | *Department of Geoinformatics Univerzita Palackého v Olomouci | Palacký University in Olomouc 17. listopadu 50 | Olomouc, 771 46 | Czech Republic
Were/Are you able to confirm the previous steps 1-8 worked (jars were added without errors, you were able to describe the taxi_demo table, the taxi data loaded? What error did you recieve when running Step 9?
You are correct that those formats can be read by hadoop (and many are splitable). We aren't currently working on a raster solution for gis-tools-for-hadoop, although we have not ruled it out. By searching something like 'HDF format hadoop input format' you should be able to find what you are looking for.
@smambrose I'm seeing the same issue as @tikos . When i run the final 'Select' I see the same warnings that you mentioned earlier, but then I see the hive> prompt and no results.
Steps 1-8 worked well ;) I don´t know if step 9 had some error message yet because I had to turn it off ... but I will try it again ... (its really long process - its was running more than 1 hour... and still 89% Map then again 30% ... I will try it and write results ...
Hadoop HDF and stuff ... That would be great to implement that to these tools ...
@jmirmelstein and @tikos are you both using the Hortonworks sandbox when you see the original error without results for the sample? What version are you on?
Warning: Map Join MAPJOIN[18][bigTable=earthquakes] in task 'Stage-2:MAPRED' is a cross product
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
15/05/03 00:48:43 WARN conf.Configuration: file:/tmp/root/hive_2015-05-03_00-48-28_169_4942355857717156539-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
I'm not currently able to reproduce it. Are there any other details if you start hive hive
without including the -S
AI am using version 2.1 since in tutorial is written that you had some issues with 2.2. version, I am using everything same with tutorials just for case... I will try it again ASAP my VM start Hortonworks Sandbox ... Its really slow (I have 5 -years old laptop) and I am using it right now so ...
So...I tried it (that earthquake aggregation sample not taxi) and it passed, I get hive at the end, but not that results (name and count of earthquake) and I am attaching here whole console from beginning to the end:
[root@sandbox ~]# hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.p roperties
hive> add jar
> ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api. jar
> ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop .jar;
Added /root/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar to c lass path
Added resource: /root/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-ap i.jar
Added /root/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar to class path
Added resource: /root/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hado op.jar
hive> create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
Time taken: 4.993 seconds
hive> create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains ';
Time taken: 0.892 seconds
hive> CREATE EXTERNAL TABLE IF NOT EXISTS earthquakes (earthquake_date STRING, l atitude DOUBLE, longitude DOUBLE, depth DOUBLE, magnitude DOUBLE,
> magtype string, mbstations string, gap string, distance string, rms st ring, source string, eventid string)
> LOCATION '${env:HOME}/esri-git/gis-tools-for-hadoop/samples/data/earthquak e-data';
Time taken: 2.926 seconds
hive> CREATE EXTERNAL TABLE IF NOT EXISTS counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)
> ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
> STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
> LOCATION '${env:HOME}/esri-git/gis-tools-for-hadoop/samples/data/counties-data';
Time taken: 0.865 seconds
hive> SELECT, count(*) cnt FROM counties
> JOIN earthquakes
> WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
> ORDER BY cnt desc;
Warning: Map Join MAPJOIN[18][bigTable=earthquakes] in task 'Stage-2:MAPRED' is a cross product
Query ID = root_20150504141212_c9bdd531-5033-4b02-8447-69e26ed76b3a
Total jobs = 2
15/05/04 14:12:56 WARN conf.Configuration: file:/tmp/root/hive_2015-05-04_14-12-17_001_4646085809202048677-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
15/05/04 14:12:56 WARN conf.Configuration: file:/tmp/root/hive_2015-05-04_14-12-17_001_4646085809202048677-1/-local-10008/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
Execution log at: /tmp/root/root_20150504141212_c9bdd531-5033-4b02-8447-69e26ed76b3a.log
2015-05-04 02:13:01 Starting to launch local task to process map join; maximum memory = 260177920
2015-05-04 02:13:05 Dump the side-table into file: file:/tmp/root/hive_2015-05-04_14-12-17_001_4646085809202048677-1/-local-10005/HashTable-Stage-2/MapJoin-mapfile00--.hashtable
2015-05-04 02:13:05 Uploaded 1 File to: file:/tmp/root/hive_2015-05-04_14-12-17_001_4646085809202048677-1/-local-10005/HashTable-Stage-2/MapJoin-mapfile00--.hashtable (260 bytes)
2015-05-04 02:13:05 End of local task; Time Taken: 4.91 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1430772586818_0001, Tracking URL =
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1430772586818_0001
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2015-05-04 14:14:06,510 Stage-2 map = 0%, reduce = 0%
2015-05-04 14:15:06,904 Stage-2 map = 0%, reduce = 0%, Cumulative CPU 2.69 sec
2015-05-04 14:15:10,564 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 3.64 sec
2015-05-04 14:16:10,793 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 3.64 sec
2015-05-04 14:16:52,976 Stage-2 map = 100%, reduce = 67%, Cumulative CPU 4.9 sec
2015-05-04 14:16:58,387 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 7.74 sec
MapReduce Total cumulative CPU time: 7 seconds 740 msec
Ended Job = job_1430772586818_0001
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1430772586818_0002, Tracking URL =
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1430772586818_0002
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2015-05-04 14:18:22,943 Stage-3 map = 0%, reduce = 0%
2015-05-04 14:19:20,051 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.37 sec
2015-05-04 14:20:20,055 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.37 sec
2015-05-04 14:21:13,236 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 4.62 sec
MapReduce Total cumulative CPU time: 4 seconds 620 msec
Ended Job = job_1430772586818_0002
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 7.74 sec HDFS Read: 276 HDFS Write: 96 SUCCESS
Job 1: Map: 1 Reduce: 1 Cumulative CPU: 4.62 sec HDFS Read: 473 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 360 msec
Time taken: 538.985 seconds
thanks @tikos
what do you get when you type
select * from counties limit 1;
select * from earthquakes limit1;
I get this
hive> select * from counties limit 1;
Time taken: 0.826 seconds
hive> select * from earthquakes limit1;
Time taken: 0.818 seconds
its nice to know that it is OK ... but iI want to see some results :D
hi @tikos,
Looks like the tables are empty. You can try doing drop table earthquakes; drop table counties; exit;
when you exit hive, make sure you are in the esri-git directory.
You can then re-run the sample (without using the -S
on hive
). We'll continue to troubleshoot -will probably not have anything for today though. Please keep us updated of your progress, and if you have any informative error messages.
Hi @smambrose ,
I tried it, and I have same error ... I have only OK result
@smambrose - this is what i see:
hive> drop table counties; FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.esri.hadoop.hive.serde.JsonSerde not found) hive> [root@sandbox esri-git]#
@tikos @jmirmelstein Thanks for all your help, we think we have it figured out. We will update the sample - but for now this should get it working.
In cygwin, you'll want to be in you esri-git
directory ([root@sandbox esri-git]#
) and complete the following:
#make a earthquake demo directory in hadoop
hadoop fs -mkdir earthquake-demo
#hadoop fs -put /path/on/localsystem /path/to/hdf
hadoop fs -put gis-tools-for-hadoop/samples/data/counties-data earthquake-demo
hadoop fs -put gis-tools-for-hadoop/samples/data/earthquake-data earthquake-demo
#check that it worked:
hadoop fs -ls earthquake-demo
Start up Hive and add the jars and functions:
add jar
create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
Drop existing tables and create new empty ones:
drop table earthquakes;
CREATE TABLE earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, depth DOUBLE, magnitude DOUBLE,
magtype string, mbstations string, gap string, distance string, rms string, source string, eventid string)
drop table counties;
CREATE TABLE counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
Load data into the tables:
LOAD DATA INPATH 'earthquake-demo/earthquake-data/earthquakes.csv' OVERWRITE INTO TABLE earthquakes;
LOAD DATA INPATH 'earthquake-demo/counties-data/california-counties.json' OVERWRITE INTO TABLE counties;
You should now be able to complete the analysis in the sample
Hi ... I have again problem ... in last step DATA INPATH I got this:
hive> DATA INPATH 'earthquake-demo/earthquake-data/earthquakes.csv' OVERWRITE INTO TABLE earthquakes;
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(
at org.apache.hadoop.hive.ql.Driver.compile(
at org.apache.hadoop.hive.ql.Driver.compile(
at org.apache.hadoop.hive.ql.Driver.compileInternal(
at org.apache.hadoop.hive.ql.Driver.runInternal(
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(
at org.apache.hadoop.hive.cli.CliDriver.processCmd(
at org.apache.hadoop.hive.cli.CliDriver.processLine(
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(
at org.apache.hadoop.hive.cli.CliDriver.main(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.hadoop.util.RunJar.main(
FAILED: ParseException line 1:0 cannot recognize input near 'DATA' 'INPATH' ''earthquake-demo/earthquake-data/earthquakes.csv''
> DATA INPATH 'earthquake-demo/california-counties.json' OVERWRITE INTO TABLE counties;
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(
at org.apache.hadoop.hive.ql.Driver.compile(
at org.apache.hadoop.hive.ql.Driver.compile(
at org.apache.hadoop.hive.ql.Driver.compileInternal(
at org.apache.hadoop.hive.ql.Driver.runInternal(
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(
at org.apache.hadoop.hive.cli.CliDriver.processCmd(
at org.apache.hadoop.hive.cli.CliDriver.processLine(
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(
at org.apache.hadoop.hive.cli.CliDriver.main(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.hadoop.util.RunJar.main(
FAILED: ParseException line 1:0 cannot recognize input near 'DATA' 'INPATH' ''earthquake-demo/california-counties.json''
I had to change path to .csv and .json because it was different steps before this are OK
can you type this in? (after exiting hive)
hadoop fs -ls earthquake-demo
- what is the output?
Did you mean to get rid of /counties-data/
in the command LOAD DATA INPATH 'earthquake-demo/counties-data/california-counties.json' OVERWRITE INTO TABLE counties;
? Is that what you meant by changing the path?
Yes I can ... I did it before so here is result of that:
[root@sandbox esri-git]# hadoop fs -ls earthquake-demo
Found 2 items
-rw-r--r-- 1 root root 1028330 2015-05-07 10:19 earthquake-demo/california-counties.json
drwxr-xr-x - root root 0 2015-05-07 10:20 earthquake-demo/earthquake-data
And yes, this change I made
BTW: Taxi demo sample ... I am again in step 9 .. .change the value from 0.01 to 1 to make it faster BUT ... it is still slow ... or its okey? I am just asking because I don´t know ... Isn ´t it weird ?
hive> FROM (SELECT ST_Bin(1, ST_Point(dropoff_longitude,dropoff_latitude)) bin_id, *FROM taxi_demo) bins
> SELECT ST_BinEnvelope(1, bin_id) shape,
> COUNT(*) count
> GROUP BY bin_id;
Query ID = root_20150507120909_e000001c-4259-48dd-8e98-3684d0e94566
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1431013082714_0001, Tracking URL =
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1431013082714_0001
Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 3
2015-05-07 12:11:05,397 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:12:09,782 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:13:25,554 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:14:37,769 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:15:38,096 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:16:42,504 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:17:25,371 Stage-1 map = 89%, reduce = 0%
2015-05-07 12:18:11,890 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:19:12,073 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:20:12,697 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:21:26,323 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:22:26,650 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:23:33,421 Stage-1 map = 11%, reduce = 0%
2015-05-07 12:23:35,272 Stage-1 map = 56%, reduce = 0%
2015-05-07 12:24:09,535 Stage-1 map = 89%, reduce = 0%
2015-05-07 12:24:41,054 Stage-1 map = 67%, reduce = 0%
2015-05-07 12:25:28,244 Stage-1 map = 44%, reduce = 0%
2015-05-07 12:26:22,278 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:28:36,400 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:29:46,988 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:30:47,851 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:31:48,892 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:33:13,617 Stage-1 map = 0%, reduce = 0%
2015-05-07 12:34:14,299 Stage-1 map = 0%, reduce = 0%
EDIT: I guess that my last code isn´t good for me because it end now with this:
2015-05-07 12:56:44,267 Stage-1 map = 0%, reduce = 0%
2015-05-07 13:07:46,126 Stage-1 map = 89%, reduce = 0% Job status not available
at org.apache.hadoop.mapreduce.Job.updateStatus(
at org.apache.hadoop.mapreduce.Job.getStatus(
at org.apache.hadoop.mapred.JobClient.getJob(
at org.apache.hadoop.hive.ql.exec.Task.executeTask(
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(
at org.apache.hadoop.hive.ql.Driver.launchTask(
at org.apache.hadoop.hive.ql.Driver.execute(
at org.apache.hadoop.hive.ql.Driver.runInternal(
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(
at org.apache.hadoop.hive.cli.CliDriver.processCmd(
at org.apache.hadoop.hive.cli.CliDriver.processLine(
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(
at org.apache.hadoop.hive.cli.CliDriver.main(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.hadoop.util.RunJar.main(
Ended Job = job_1431013082714_0001 with exception ' status not available )'
FAILED: Execution Error, return code 1 from
@smambrose Please, could you help me little bit with this?
I'm going to create a new issue for the aggregation problems.
As for the data inpath error - I have been able to reproduce it, and am still looking into the cause.
I was able to run the above code again, instead of using "earthquake-demo" I tried something else, and it worked. So you might want to try it again if you haven't. Here is what I had:
Loading them into "try1"
[root@sandbox esri-git]# hadoop fs -mkdir try1
[root@sandbox esri-git]# hadoop fs -put gis-tools-for-hadoop/samples/data/counties-data try1
[root@sandbox esri-git]# hadoop fs -put gis-tools-for-hadoop/samples/data/earthquake-data try1
[root@sandbox esri-git]# hadoop fs -ls try1
Found 2 items
drwxr-xr-x - root root 0 2015-05-12 08:47 try1/counties-data
drwxr-xr-x - root root 0 2015-05-12 08:47 try1/earthquake-data
[root@sandbox esri-git]# hadoop fs -ls try1/counties-data
Found 1 items
-rw-r--r-- 1 root root 1028330 2015-05-12 08:47 try1/counties-data/california-counties.json
[root@sandbox esri-git]# hadoop fs -ls try1/earthquake-data
Found 1 items
-rw-r--r-- 1 root root 5742716 2015-05-12 08:47 try1/earthquake-data/earthquakes.csv
Use Hive and load jars/functions
[root@sandbox esri-git]# hive
add jar
Logging initialized using configuration in file:/etc/hive/conf.dist/
> add jar
> ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar
> ${env:HOME}/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar;
Added /root/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar to class path
Added resource: /root/esri-git/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar
Added /root/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar to class path
Added resource: /root/esri-git/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar
hive> create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
Time taken: 1.814 seconds
hive> create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
Time taken: 0.545 seconds
View tables/drop old ones/create new ones
hive> show tables;
Time taken: 0.687 seconds, Fetched: 6 row(s)
hive> drop table counties;
Time taken: 0.999 seconds
hive> drop table earthquakes;
Time taken: 0.6 seconds
hive> CREATE TABLE earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, depth DOUBLE, magnitude DOUBLE,
> magtype string, mbstations string, gap string, distance string, rms string, source string, even tid string)
Time taken: 0.53 seconds
hive> show tables
> ;
Time taken: 0.337 seconds, Fetched: 5 row(s)
Tried loading from "earthquakes-demo"
hive> DATA INPATH 'earthquake-demo/earthquake-data/earthquakes.csv' OVERWRITE INTO TABLE earthquakes;
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(
at org.apache.hadoop.hive.ql.Driver.compile(
at org.apache.hadoop.hive.ql.Driver.compile(
at org.apache.hadoop.hive.ql.Driver.compileInternal(
at org.apache.hadoop.hive.ql.Driver.runInternal(
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(
at org.apache.hadoop.hive.cli.CliDriver.processCmd(
at org.apache.hadoop.hive.cli.CliDriver.processLine(
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(
at org.apache.hadoop.hive.cli.CliDriver.main(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.hadoop.util.RunJar.main(
FAILED: ParseException line 1:0 cannot recognize input near 'DATA' 'INPATH' ''earthquake-demo/earthquake- data/earthquakes.csv''
Loaded from new location (try1)
hive> LOAD DATA INPATH 'try1/earthquake-data/earthquakes.csv' OVERWRITE INTO TABLE earthquakes;
Loading data to table default.earthquakes
rmr: DEPRECATED: Please use 'rm -r' instead.
Moved: 'hdfs://' to trash at: hdfs://sandbox.
Table default.earthquakes stats: [numFiles=1, numRows=0, totalSize=5742716, rawDataSize=0]
Time taken: 1.327 seconds
Created counties table and loaded data
hive> CREATE TABLE counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)
> ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
> STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
Time taken: 1.875 seconds
hive> LOAD DATA INPATH 'earthquake-demo/counties-data/california-counties.json' OVERWRITE INTO TABLE coun ties;
FAILED: SemanticException Line 1:17 Invalid path ''earthquake-demo/counties-data/california-counties.json '': No files matching path hdfs:// lifornia-counties.json
hive> LOAD DATA INPATH 'try1/counties-data/california-counties.json' OVERWRITE INTO TABLE counties;
Loading data to table default.counties
rmr: DEPRECATED: Please use 'rm -r' instead.
Moved: 'hdfs://' to trash at: hdfs://sandbox.hor
Table default.counties stats: [numFiles=1, numRows=0, totalSize=1028330, rawDataSize=0]
Time taken: 0.609 seconds
An important thing to note is that once you load data, you have to use the put command again (like hadoop fs -put gis-tools-for-hadoop/samples/data/counties-data try1
) because it is deleted in the process of loading (Moved: 'hdfs://' to trash at: hdfs://sandbox.hor
Please let me know if a second run through works.