ideas
ideas copied to clipboard
Upload and Download UTF-8 named files as it was
CKAN uploader uses ckan.lib.munge_filename() which force convert filenames' characters to 100% ascii code string. It might be suitable for Latin characters, but is terrible for multi-byte characters like CJK. So, my idea is to keep UTF-8 for upload but use URL encoding (percent-encoding) as a part of URLs.
[current] "日本語.csv" -> ".csv"
[my idea] "日本語.csv" -> "%E6%97%A5%E6%9C%AC%E8%AA%9E.csv"
If I skip ckan.lib.munge_filename(), resource.url field in database seems to be encoded fortunately. So my patch for the idea is followed and went good for me.
When I tried with some browsers, UTF-8 filename is successfully recovered when downloaded even if Content-Disposition header is absent.
I can send pull-request if you want, but please tell me your opinions (especially those who use multi-byte characters) before do that.
index 9acb15095..e05f73ca7 100644
--- a/ckan/lib/dictization/model_dictize.py
+++ b/ckan/lib/dictization/model_dictize.py
@@ -109,6 +109,8 @@ def resource_dictize(res, context):
## in the frontend. Without for_edit the whole qualified url is returned.
if resource.get('url_type') == 'upload' and not context.get('for_edit'):
url = url.rsplit('/')[-1]
+ import urllib
+ url = urllib.unquote(url.encode('utf_8'))
cleaned_name = munge.munge_filename(url)
resource['url'] = h.url_for(controller='package',
action='resource_download',
diff --git a/ckan/lib/munge.py b/ckan/lib/munge.py
index 3ef97ba1c..d869fa430 100644
--- a/ckan/lib/munge.py
+++ b/ckan/lib/munge.py
@@ -155,12 +155,6 @@ def munge_filename(filename):
# Ignore path
filename = os.path.split(filename)[1]
- # Clean up
- filename = filename.lower().strip()
- filename = substitute_ascii_equivalents(filename)
- filename = re.sub(u'[^a-zA-Z0-9_. -]', '', filename).replace(u' ', u'-')
- filename = re.sub(u'-+', u'-', filename)
-
# Enforce length constraints
name, ext = os.path.splitext(filename)
ext = ext[:MAX_FILENAME_EXTENSION_LENGTH]
I understand this issue, but I'd rather adress it differently. If you change resource_dictize and munge_filename this will be a breaking change for many existing CKAN instances.
Maybe we could add a new config option (disabled by default) if urlencode should be used instead of substituting with ASCII. WDYT?
Maybe we could add a new config option (disabled by default) if urlencode should be used instead of substituting with ASCII.
Agreed!
I think unquote in resource_dictize makes no effect for ascii strings. So only converting to ascii should be parameterized. Right?