datapusher
datapusher copied to clipboard
Authorization header problem - push to datastore fails - HTTP status code: 400
When trying to use datapusher, it would consistently fail.
Datapusher log:
ckan | 2017-02-23 21:58:40,052 INFO [ckan.lib.base] /dataset/test-4/resource_data/1aec102a-035e-4f7b-856b-e9eaf49510d8 render time 0.144 seconds
ckan | 2017-02-23 21:58:40,451 INFO [ckan.lib.base] /dataset/49de5ea7-29c2-431f-ab27-eaaf74d9469b/resource/1aec102a-035e-4f7b-856b-e9eaf49510d8/download/techcrunchcontinentalusa-2.csv render time 0.461 seconds
ckan | 2017-02-23 21:58:40,520 INFO [ckan.lib.base] /api/i18n/en render time 0.002 seconds
ckan | 2017-02-23 21:58:40,645 INFO [ckan.lib.base] /api/3/action/datapusher_hook render time 0.015 seconds
ckan_datapusher | --------------------------------------------------------------------------------
ckan_datapusher | ERROR in scheduler [/usr/lib/python2.7/site-packages/apscheduler/scheduler.py:520]:
ckan_datapusher | Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
ckan_datapusher | --------------------------------------------------------------------------------
ckan_datapusher | Traceback (most recent call last):
ckan_datapusher | File "/usr/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
ckan_datapusher | retval = job.func(*job.args, **job.kwargs)
ckan_datapusher | File "build/bdist.linux-x86_64/egg/datapusher/jobs.py", line 321, in push_to_datastore
ckan_datapusher | request_url=resource.get('url'), response=e.read())
ckan_datapusher | HTTPError
Ckan HTML Log:
Error: DataPusher received a bad HTTP response when trying to download the data file
HTTP status code: 400
Response: <?xml version="1.0" encoding="UTF-8"?> <Error><Code>InvalidArgument</Code><Message>Authorization header is invalid -- one and only one ' ' (space) required</Message><ArgumentName>Authorization</Argume...
Requested URL: http://data2.vta.org/dataset/49de5ea7-29c2-431f-ab27-eaaf74d9469b/resource/1aec102a-035e-4f7b-856b-e9eaf49510d8/download/techcrunchcontinentalusa-2.csv
This is using a docker-deploy with ckan 2.6.0 and datapusher 0.10.0
The problem lies with this area of code: https://github.com/ckan/datapusher/commit/2394e81c47205abe7df4be7215aa66fa63bedc1e I commented out the line about adding an authorization header and then rebuilt the datapusher - it worked for all my public files and failed for all my private files.
I'm not sure if this problem is in my code and config or updates to ckan.
can you please provide more information about the setup, are you using local filestore or the files are stored on some external service
The files were pushed to S3 and the data, through the api, is pushed to Postgres on RDS.
The full config is here: https://github.com/vta/Open-Data-Portal
We had similar problem when datapusher was getting the file from S3, in that case we started our server with more workers. Can you please try the same?
I don't totally understand the concept of workers - I don't think I have workers on my setup. There is celery that does the uploads. I found a command to spin up more workers but it didn't do anything. It isn't an issue of it only working some of the time and not all of the time. I think the relevant issue is around authentication.
Was there a resolution for this? I am seeing this same issue.
What I think is happening here is that when https://github.com/ckan/datapusher/commit/2394e81c47205abe7df4be7215aa66fa63bedc1e was introduced, the only available backend for the FileStore was local storage. What this commit did was adding an Authorization
header to DataPusher requests for uploaded files so CKAN itself could check the authorization.
But when using another backend (eg S3), the Authorization
header must be passed as well, and AWS is not happy about it:
<?xml version="1.0" encoding="UTF-8"?> <Error><Code>InvalidArgument</Code><Message>Authorization header is invalid -- one and only one ' ' (space) required</Message><ArgumentName>Authorization</Argum
@3vivekb what extension are you using for S3 storage? @Rjones8901 are you using a different storage backend?
We are using S3 for the file source and RDS for datastore. I found that if I edited the jobs.py to api_key = api_key.strip(api_key) it threw a new error related to datastore user invalid. I agree with your assessment of the problem. I made this update to api_key at line 340 in the jobs.py for datapusher. https://github.com/ckan/datapusher/blob/master/datapusher/jobs.py
I should note that if the workaround for this is commenting out lines 342/345 however the S3 file must be public then for datapusher to be able to grab it.
@Rjones8901 what extension are you using for the S3 storage?
ckanext_cloudstorage
@Rjones8901 @3vivekb I believe this is an issue with ckanext-cloudstorage, see https://github.com/TkTech/ckanext-cloudstorage/issues/13
@amercader I think this is an issue datapusher rather than with ckanext-cloudstorage.
When you use CloudStorage the attempt to get the resource in jobs.push_to_datastore
returns a Redirect that points to the correct remote location. CloudStorage is only returning the new location, it isn't setting any headers.
The version of datapusher that I am using use urllib2.urlopen
to get the resource, and that follows the redirect, but keeps all the same headers.
I think that datapusher should remove the Authorization header that it added if the redirect is to a different domain. My installation is working correctly with private resources on AWS S3 using ckanext-cloudstorage using the following code:
class HTTPRedirectHandler(urllib2.HTTPRedirectHandler):
"""Remove authorization header if redirecting to a new domain."""
def redirect_request(self, req, fp, code, msg, headers, newurl):
oldloc = urlparse.urlparse(req.get_full_url()).netloc
newloc = urlparse.urlparse(newurl).netloc
if 'Authorization' in req.headers and newloc != oldloc:
del req.headers['Authorization']
return urllib2.HTTPRedirectHandler.redirect_request(self,
req, fp, code, msg, headers, newurl)
Then in push_to_datastore
:
if resource.get('url_type') == 'upload':
# If this is an uploaded file to CKAN, authenticate the request,
# otherwise we won't get file from private resources
request.add_header('Authorization', api_key)
opener = urllib2.build_opener(HTTPRedirectHandler())
response = opener.open(request, timeout=DOWNLOAD_TIMEOUT)
I will try a version of datapusher from trunk and see if we have the same problem with the current version, which uses requests
instead of urllib2.urlopen
.
I can't recreate this issue using the current version of datapusher that uses requests.get
instead of urllib2.urlopen
to get the resource. It can be closed as far as I can tell, although obviously I didn't open it.
@3vivekb @Rjones8901 if either of you are still active, have you had a chance to retry this with ckanext-datapusher master?
Negative. I haven't had a chance to retry this yet with a new master, however I was seeing this about 2 weeks ago with CKAN 2.7
Sent via the Samsung Galaxy S8+, an AT&T 4G LTE smartphone -------- Original message --------From: Tyler Kennedy [email protected] Date: 10/13/17 12:04 PM (GMT-05:00) To: ckan/datapusher [email protected] Cc: Richard Jones [email protected], Mention [email protected] Subject: Re: [ckan/datapusher] Authorization header problem - push to datastore fails - HTTP status code: 400 (#116) @3vivekb @Rjones8901 if either of you are still active, have you had a chance to retry this with ckanext-datapusher master?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/ckan/datapusher","title":"ckan/datapusher","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/ckan/datapusher"}},"updates":{"snippets":[{"icon":"PERSON","message":"@TkTech in #116: @3vivekb @Rjones8901 if either of you are still active, have you had a chance to retry this with ckanext-datapusher master?"}],"action":{"name":"View Issue","url":"https://github.com/ckan/datapusher/issues/116#issuecomment-336495804"}}}
do we have any resolution for this issue wherein the Datapusher errors with access denied for S3