pywb
pywb copied to clipboard
Broken URI archive_paths support
This change breaks our archive_paths: "webhdfs://server/" because
os.path.join` just discards the prefix when the suffix is an absolute path.
https://github.com/webrecorder/pywb/blob/92e459bda52a2b03f33a4b0b8094ed424248d2a5/pywb/warcserver/resource/pathresolvers.py#L40
Hm, not sure I understand.. This seems as expected:
The filename should generally be a relative path:
>>> os.path.join('webhdfs://server/', 'filename.warc')
'webhdfs://server/filename.warc'
Though, if it needs to be absolute, then archive_paths: ''
should work:
os.path.join('', 'webhdfs://filename.warc')
'webhdfs://filename.warc'
Or do you have a mix of absolute and relative? Then this would be problematic:
>>> os.path.join('webhdfs://server/', 'webhdfs://server/filename.warc')
'webhdfs://server/webhdfs://server/filename.warc'
The problem is, ours look like this:
os.path.join('webhdfs://server', '/file/path/on/hdfs.warc.gz')
which gives /file/path/on/hdfs.warc.gz
but the old code gave webhdfs://server/file/path/on/hdfs.warc.gz
.
Ah i see. Hm, perhaps should just keep old behavior for now.. was designed to deal with edge cases where slash is missing..