ideas Upload and Download UTF-8 named files as it was

CKAN uploader uses ckan.lib.munge_filename() which force convert filenames' characters to 100% ascii code string. It might be suitable for Latin characters, but is terrible for multi-byte characters like CJK. So, my idea is to keep UTF-8 for upload but use URL encoding (percent-encoding) as a part of URLs.

[current] "日本語.csv" -> ".csv"

[my idea] "日本語.csv" -> "%E6%97%A5%E6%9C%AC%E8%AA%9E.csv"

If I skip ckan.lib.munge_filename(), resource.url field in database seems to be encoded fortunately. So my patch for the idea is followed and went good for me.

When I tried with some browsers, UTF-8 filename is successfully recovered when downloaded even if Content-Disposition header is absent.

I can send pull-request if you want, but please tell me your opinions (especially those who use multi-byte characters) before do that.

index 9acb15095..e05f73ca7 100644
--- a/ckan/lib/dictization/model_dictize.py
+++ b/ckan/lib/dictization/model_dictize.py
@@ -109,6 +109,8 @@ def resource_dictize(res, context):
     ## in the frontend. Without for_edit the whole qualified url is returned.
     if resource.get('url_type') == 'upload' and not context.get('for_edit'):
         url = url.rsplit('/')[-1]
+        import urllib
+        url = urllib.unquote(url.encode('utf_8'))
         cleaned_name = munge.munge_filename(url)
         resource['url'] = h.url_for(controller='package',
                                     action='resource_download',
diff --git a/ckan/lib/munge.py b/ckan/lib/munge.py
index 3ef97ba1c..d869fa430 100644
--- a/ckan/lib/munge.py
+++ b/ckan/lib/munge.py
@@ -155,12 +155,6 @@ def munge_filename(filename):
     # Ignore path
     filename = os.path.split(filename)[1]
 
-    # Clean up
-    filename = filename.lower().strip()
-    filename = substitute_ascii_equivalents(filename)
-    filename = re.sub(u'[^a-zA-Z0-9_. -]', '', filename).replace(u' ', u'-')
-    filename = re.sub(u'-+', u'-', filename)
-
     # Enforce length constraints
     name, ext = os.path.splitext(filename)
     ext = ext[:MAX_FILENAME_EXTENSION_LENGTH]

Nov 13 '19 08:11 MasaGon

I understand this issue, but I'd rather adress it differently. If you change resource_dictize and munge_filename this will be a breaking change for many existing CKAN instances.

Maybe we could add a new config option (disabled by default) if urlencode should be used instead of substituting with ASCII. WDYT?

Nov 13 '19 08:11 metaodi

Maybe we could add a new config option (disabled by default) if urlencode should be used instead of substituting with ASCII.

Agreed!

Nov 13 '19 08:11 MasaGon

I think unquote in resource_dictize makes no effect for ascii strings. So only converting to ascii should be parameterized. Right?

Nov 13 '19 09:11 MasaGon

ideas ideas copied to clipboard

Upload and Download UTF-8 named files as it was

ideas
ideas copied to clipboard