gis-tools-for-hadoop
gis-tools-for-hadoop copied to clipboard
Copy from HDFS not working
Here is the error details
I have Windows 7 as Host machine and Cloudera VM (Linux) as guest machine. Arcmap is installed on Host machine please find the screen shot below while accessing.
http://quickstart.cloudera:50070/dfshealth.html#tab-overview (192.168.126.128:50070/) through browser on the same machine as ArcMap.
Also tried solving the problem with the help of "Copy to HDFS not working #16" But still I am facing the same problem.
Please point me to right direction.
Can you make the tool work by using the IP address of the namenode rather than the hostname?
@climbage : No I tired using the tool with IP address but couldn't make it work, any other solution.
The third image in the issue description of 2015/01/10 does not appear to be the "same problem" as the first image - it is very different. The third image shows a web browser [Firefox] dialog asking which application should be used to open the CSV file. In fact, such a web browser dialog indicates a successful download of the requested file. This may not provide entirely relevant information, as the screenshot appears to be a web browser running in the guest OS in the VM, and no screen shot is provided of a browser on the host OS trying to access the file through that URL.
@mayur7789 If you did not figure this out yet, it looks like you are using "quickstart" as an input to the GP tool from your first screenshot. It is likely that your host machine has no idea where this location of "quickstart" is located unless you actually put quick in your Windows HOSTS file. As Randall mentions, you should try accessing the same URL in your 3rd screenshot from within a web browser on your host machine instead of from within the Cloudera VM. This test it to make sure webhdfs is accessible from the host machine (the same one running the GP tool from within ArcMap) to the Cloudera VM.
If the IP URL that is shown in your third screenshot works in your host machine web browser, then trying switching your input param. within your GP tool in ArcMap from "quickstart" to just the IPv4 address which you used in the URL webhdfs test.
I am having the same issue with the out the box VM provided via https://github.com/Esri/gis-tools-for-hadoop/wiki/GIS-Tools-for-Hadoop-for-Beginners . I have followed all the instructions however I hit the same error whilst trying to run the Copy to HDFS script. I have a feeling that as per @GISDev01 the host machine has no way of desiphering what sandbox or sandbox.hortonworks.com actually is, but i'm not sure how to make that happen. If you have either @mayur7789 , @randallwhitman or @climbage have any ideas that would be greatly appreciated as I'm supposed to be demoing it soon.
@JamesMilnerUK Can you change sandbox
to localhost
and try again?
Wrong James...
@climbage thanks! That worked (woops, what a simple mistake). However now I'm faced with a new error... "Unexpected error : No JSON object could be decoded". Perhaps something funky going on with the actual created file?
@JamesMilnerUK Could be. Can you browse to the file in HDFS? http://localhost:50070
@climbage for sure; this is what I see.
@JamesMilnerUK is there a link on that page to browse the file system?
@climbage yeah, so I can browse to the file, however when I try and open the agg_samp I get "file not found" in my web browser.
I managed to download the file using HDP 2.2 which is strange:
No obvious malformation in the JSON but theres so much of it it's kind of hard to tell.
Hmm, and you created dataset through the hive sample?
@climbage Yeah, I'm following the workflow as described on https://github.com/Esri/gis-tools-for-hadoop/wiki/Getting-the-results-of-a-Hive-query-into-ArcGIS . Everything appears to run correctly:
But I get the JSON error now whilst running the Copy from HDFS tool
Edit: Hang on, I do get this error "FAILED: SemanticException [Error 10004]: Line 2:7 Invalid table alias or column reference 'event_date': (possible column names are: earthquake_date, latitude, longitude, depth, mag "
which I have a feeling relates to issue #24 ? If i change it to earthquake_date it runs it runs error free in hive but i feel the problem may run deeper.
How about select count(*) from agg_samp
?
@climbage if I replace event_date with earthquake_date it comes back with "OK 12948". It also seems it's not giving the finished file as JSON.
When are you replacing it?
Replacing event_date (as described in tutorial: https://github.com/Esri/gis-tools-for-hadoop/wiki/Getting-the-results-of-a-Hive-query-into-ArcGIS) with earthsquake_date because otherwise it fails.
CREATE TABLE IF NOT EXISTS earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA LOCAL INPATH './earthquakes.csv' OVERWRITE INTO TABLE earthquakes;
CREATE TABLE earthquakes_new(earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE); INSERT OVERWRITE TABLE earthquakes_new SELECT earthquake_date, latitude, longitude, magnitude FROM earthquakes WHERE latitude is not null;
Thank you for your help by the way.
Interesting. So what if you re-run the GP tool now?
The hive commands pass per screenshot provided above. The GP tool gives: Unexpected error : No JSON object could be decoded
Interestingly if I copy the JSON to my desktop and run the second GP task (JSON to Feature Layer) I get this:
I'm not sure if this is how it's supposed to look as no example is provided, but I'm assuming that it is correct (feature table looks OK; visualises OK).
Well, that is how it's supposed to look so you're at least getting the right results.
Actually, I think I might know why the copy from HDFS too is not working. Instead of,
apps/hive/warehouse/...
use
/apps/hive/warehouse/...
If you don't root the path it will assume it's relative to your home directory /users/[username]
Thanks for the suggestion. I have attempted it and unforunately still getting the same error: (Unexpected error : [Errno 11004] getaddrinfo failed). Changing it back to no slash gives me the JSON error which is more promising (so to speak!).
Also, just out of interest what is this query doing? Aggregating the number of earthquakes in a year into square polygons?
Can you post the traceback for the error this time?
Executing: CopyFromHDFS localhost 50070 root apps/hive/warehouse/agg_samp C:\Users\jmilner\Desktop\earthquakes3.json
Start Time: Wed Apr 08 20:35:45 2015
Running script CopyFromHDFS...
Unexpected error : No JSON object could be decoded
Traceback (most recent call last):
File "<string>", line 265, in execute
File "C:\Users\jmilner\Documents\Developer Evangelist\GeoDev Meetups\GeoDev2\geoprocessing-tools-for-hadoop\webhdfs\webhdfs.py", line 181, in getFileStatus
data_dict = json.loads(response.read())
File "C:\Python27\Lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\Lib\json\decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\Lib\json\decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
I also tried changing
DROP TABLE agg_samp;
CREATE TABLE agg_samp(area binary, count double)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
to
DROP TABLE agg_samp;
CREATE TABLE agg_samp(area binary, count double)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.UnenclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
With still the same error (Hive didn't fail however).
It still looks like you don't have a /
at the beginning of your path
@climbage if I put a / I get the old error:
Executing: CopyFromHDFS localhost 50070 root /apps/hive/warehouse/agg_samp
Unexpected error : [Errno 11004] getaddrinfo failed
Traceback (most recent call last):
File "<string>", line 277, in execute
File "C:\Users\jmilner\Documents\Developer Evangelist\GeoDev Meetups\GeoDev2\geoprocessing-tools-for-hadoop\webhdfs\webhdfs.py", line 152, in copyFromHDFS
fileDownloadClient.request('GET', redirect_path, headers={})
File "C:\Python27\Lib\httplib.py", line 995, in request
self._send_request(method, url, body, headers)
File "C:\Python27\Lib\httplib.py", line 1029, in _send_request
self.endheaders(body)
File "C:\Python27\Lib\httplib.py", line 991, in endheaders
self._send_output(message_body)
File "C:\Python27\Lib\httplib.py", line 844, in _send_output
self.send(msg)
File "C:\Python27\Lib\httplib.py", line 806, in send
self.connect()
File "C:\Python27\Lib\httplib.py", line 787, in connect
self.timeout, self.source_address)
File "C:\Python27\Lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11004] getaddrinfo failed
That's what I'm looking for. This is slightly different in that now you're getting an error during redirection to a datanode.
Try this in the browser and see what you get.
http://localhost:50070/webhdfs/v1/apps/hive/warehouse/agg_samp/000000_0?op=OPEN
Thanks again. Unforunately I get "This web page is not available" . I'm redirected to:
http://sandbox.hortonworks.com:50075/webhdfs/v1/apps/hive/warehouse/agg_samp/000000_0?op=OPEN&namenoderpcaddress=sandbox.hortonworks.com:8020&offset=0
@JamesMilnerUK Is sandbox.hortonworks.com in your Windows hosts file?
Now we're getting somewhere. @GISDev01 has the solution. @smambrose can we put this in the tutorials?