pydap icon indicating copy to clipboard operation
pydap copied to clipboard

decode error for unicode charaters

Open epifanio opened this issue 5 years ago • 8 comments

maybe related to https://github.com/pydap/pydap/pull/152 and https://github.com/pydap/pydap/issues/164

Trying to get a netcdf data served via PyDap. The file in question works fine in a standard python console with direct access using python-netcdf4 while in pydap, on the web interface the das is not available and the apache log returns this error:

[Wed May 22 13:55:25.685392 2019] [wsgi:error] [pid 20625:tid 140168119965440] [client 157.249.114.74:44934]   File "/usr/local/lib/python3.6/dist-packages/pydap/responses/das.py", line 44, in __iter__, referer: http://dap.metsis.met.no/
[Wed May 22 13:55:25.685402 2019] [wsgi:error] [pid 20625:tid 140168119965440] [client 157.249.114.74:44934]     #yield line.encode('ascii'), referer: http://dap.metsis.met.no/
[Wed May 22 13:55:25.685429 2019] [wsgi:error] [pid 20625:tid 140168119965440] [client 157.249.114.74:44934] UnicodeEncodeError: 'ascii' codec can't encode character '\\xd8' in position 33: ordinal not in range(128), referer: http://dap.metsis.met.no/

a bad hack to fix the das .. is to add an exception and try to decode using utf-8 ... which now gave me a working page for the das but this doesn't fix the pydap.client ... as the error trying to laod such dataset is:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-2bb713f8a88f> in <module>
      1 from pydap.client import open_url
----> 2 dataset = open_url('http://dap.metsis.met.no/SN99938.nc')

/usr/local/lib/python3.7/dist-packages/pydap/client.py in open_url(url, application, session, output_grid, timeout, verify)
     65     """
     66     dataset = DAPHandler(url, application, session, output_grid,
---> 67                          timeout=timeout, verify=verify).dataset
     68 
     69     # attach server-side functions

/usr/local/lib/python3.7/dist-packages/pydap/handlers/dap.py in __init__(self, url, application, session, output_grid, timeout, verify)
     61                 verify=verify)
     62         raise_for_status(r)
---> 63         das = safe_charset_text(r)
     64 
     65         # build the dataset from the DDS and add attributes from the DAS

/usr/local/lib/python3.7/dist-packages/pydap/handlers/dap.py in safe_charset_text(r)
    115     else:
    116         r.charset = get_charset(r)
--> 117         return r.text
    118 
    119 

/usr/local/lib/python3.7/dist-packages/webob/response.py in _text__get(self)
    620         decoding = self.charset or self.default_body_encoding
    621         body = self.body
--> 622         return body.decode(decoding, self.unicode_errors)
    623 
    624     def _text__set(self, value):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 94: ordinal not in range(128)

if i put a print statement in /usr/local/lib/python3.7/dist-packages/webob/response.py line 622 it tells me the decoding is set to ascii while to work in my case it should be utf-8 decoding is define few lines above by : decoding = self.charset or self.default_body_encoding so adding an other try/except to switch to utf-8 ... will work but this is a hack and most important .. this is happening on the client side... where I have no control on the pydap version used by a potential user do you have any suggestion?

epifanio avatar May 22 '19 14:05 epifanio

I tried to manually set the r.charset value to UTF-8 in src/pydap/handlers/dap.py in DAPHandler() and in get_charset() with no luck.

As debug, I added a print statement in webob/response.py to see which value is passed for decoding the response:

        print('#######################')
        print(self.charset)

And also after my hardcoded UTF-8 changes, it still prints 'ascii' -

In [1]: from pydap.client import open_url                                                                                                                        

In [2]: url = 'http://internal.link.to/SN99938.nc'                                                                                                              

In [3]: dataset = open_url(url)                                                                                                                                  
#######################
ascii
#######################
ascii

It is my understanding that self.charset is set in PyDap .. so it looks like is not set properly or the manually set I did in are ignored.

The only way to bypass the error is to manually force the decoding to UTF-8 by replacing;

decoding = self.charset or self.default_body_encoding

with:

decoding='UTF-8'

A test file is available for debugging this issue at: https://epinux.com/index.php/s/3cixFyp7yktaaWL

epifanio avatar May 28 '19 08:05 epifanio

Can you help with this? It is being a crucial set back for countries that use unicode characters in their netcdf metadata :(

epifanio avatar Jun 13 '19 16:06 epifanio

Experiencing the same thing, for this URL: https://thredds.met.no/thredds/dodsC/meps25epsarchive/2017/10/29/meps_mbr0_pp_2_5km_20171029T00Z.nc

Left a comment at #162 with some more details.

tahaum avatar Oct 05 '19 15:10 tahaum

This seemed to work for me, in pydap/handlers/dap.py

 def get_charset(r):
     charset = r.charset
     if not charset:
-        charset = 'ascii'
+        charset = 'utf-8'
     return charset

petejan avatar Mar 30 '20 23:03 petejan

@petejan did you also tried to serve the same file via pydap-server or did you just to read it using pydap-client?

epifanio avatar Apr 08 '20 19:04 epifanio

@petejan did you also tried to serve the same file via pydap-server or did you just to read it using pydap-client?

I was using the pydap-client to open http://thredds.aodn.org.au/thredds/catalog/IMOS/ABOS/SOTS/2018/catalog.html?dataset=IMOS/ABOS/SOTS/2018/IMOS_ABOS-SOTS_COSTZ_20180801_SOFS_FV00_SOFS-7.5-2018-SBE37SMP-ODO-RS232-03715969-30m_END-20190327_C-20190606.nc

example

Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21) 
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pydap.client import open_url
>>> dataset = open_url('http://thredds.aodn.org.au/thredds/dodsC/IMOS/ABOS/SOTS/2018/IMOS_ABOS-SOTS_COSTZ_20180801_SOFS_FV00_SOFS-7.5-2018-SBE37SMP-ODO-RS232-03715969-30m_END-20190327_C-20190606.nc.html')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pydap/client.py", line 67, in open_url
    timeout).dataset
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pydap/handlers/dap.py", line 54, in __init__
    raise_for_status(r)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pydap/net.py", line 34, in raise_for_status
    detail=response.status+'\n'+response.text,
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/webob/response.py", line 622, in _text__get
    return body.decode(decoding, self.unicode_errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

petejan avatar Apr 08 '20 22:04 petejan

FYI also occurs on this file

import xarray as xr
ds = xr.open_dataset("https://thredds.ucar.edu/thredds/dodsC/grib/NCEP/GFS/Global_0p25deg/Best", engine="pydap")

UnicodeDecodeError                        Traceback (most recent call last)
/var/folders/rf/26llfhwd68x7cftb1z3h000w0000gp/T/ipykernel_837/4088958308.py in <module>
----> 1 ds = xr.open_dataset(url, engine="pydap")
      2 ds

~/miniconda3/envs/main/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    495 
    496     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 497     backend_ds = backend.open_dataset(
    498         filename_or_obj,
    499         drop_variables=drop_variables,

~/miniconda3/envs/main/lib/python3.9/site-packages/xarray/backends/pydap_.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, session, lock)
    137             )
    138 
--> 139         store = PydapDataStore.open(
    140             filename_or_obj,
    141             session=session,

~/miniconda3/envs/main/lib/python3.9/site-packages/xarray/backends/pydap_.py in open(cls, url, session)
     91     def open(cls, url, session=None):
     92 
---> 93         ds = pydap.client.open_url(url, session=session)
     94         return cls(ds)
     95 

~/miniconda3/envs/main/lib/python3.9/site-packages/pydap/client.py in open_url(url, application, session, output_grid, timeout)
     64     never retrieve coordinate axes.
     65     """
---> 66     dataset = DAPHandler(url, application, session, output_grid,
     67                          timeout).dataset
     68 

~/miniconda3/envs/main/lib/python3.9/site-packages/pydap/handlers/dap.py in __init__(self, url, application, session, output_grid, timeout)
     62         if not r.charset:
     63             r.charset = 'ascii'
---> 64         das = r.text
     65 
     66         # build the dataset from the DDS and add attributes from the DAS

~/miniconda3/envs/main/lib/python3.9/site-packages/webob/response.py in _text__get(self)
    620         decoding = self.charset or self.default_body_encoding
    621         body = self.body
--> 622         return body.decode(decoding, self.unicode_errors)
    623 
    624     def _text__set(self, value):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 69191: ordinal not in range(128)

raybellwaves avatar Sep 14 '21 04:09 raybellwaves

I tried to manually set the r.charset value to UTF-8 in src/pydap/handlers/dap.py in DAPHandler() and in get_charset() with no luck.

As debug, I added a print statement in webob/response.py to see which value is passed for decoding the response:

        print('#######################')
        print(self.charset)

And also after my hardcoded UTF-8 changes, it still prints 'ascii' -

In [1]: from pydap.client import open_url                                                                                                                        

In [2]: url = 'http://internal.link.to/SN99938.nc'                                                                                                              

In [3]: dataset = open_url(url)                                                                                                                                  
#######################
ascii
#######################
ascii

It is my understanding that self.charset is set in PyDap .. so it looks like is not set properly or the manually set I did in are ignored.

The only way to bypass the error is to manually force the decoding to UTF-8 by replacing;

decoding = self.charset or self.default_body_encoding

with:

decoding='UTF-8'

maybe related to #152 and #164

Trying to get a netcdf data served via PyDap. The file in question works fine in a standard python console with direct access using python-netcdf4 while in pydap, on the web interface the das is not available and the apache log returns this error:

[Wed May 22 13:55:25.685392 2019] [wsgi:error] [pid 20625:tid 140168119965440] [client 157.249.114.74:44934]   File "/usr/local/lib/python3.6/dist-packages/pydap/responses/das.py", line 44, in __iter__, referer: http://dap.metsis.met.no/
[Wed May 22 13:55:25.685402 2019] [wsgi:error] [pid 20625:tid 140168119965440] [client 157.249.114.74:44934]     #yield line.encode('ascii'), referer: http://dap.metsis.met.no/
[Wed May 22 13:55:25.685429 2019] [wsgi:error] [pid 20625:tid 140168119965440] [client 157.249.114.74:44934] UnicodeEncodeError: 'ascii' codec can't encode character '\\xd8' in position 33: ordinal not in range(128), referer: http://dap.metsis.met.no/

a bad hack to fix the das .. is to add an exception and try to decode using utf-8 ... which now gave me a working page for the das but this doesn't fix the pydap.client ... as the error trying to laod such dataset is:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-2bb713f8a88f> in <module>
      1 from pydap.client import open_url
----> 2 dataset = open_url('http://dap.metsis.met.no/SN99938.nc')

/usr/local/lib/python3.7/dist-packages/pydap/client.py in open_url(url, application, session, output_grid, timeout, verify)
     65     """
     66     dataset = DAPHandler(url, application, session, output_grid,
---> 67                          timeout=timeout, verify=verify).dataset
     68 
     69     # attach server-side functions

/usr/local/lib/python3.7/dist-packages/pydap/handlers/dap.py in __init__(self, url, application, session, output_grid, timeout, verify)
     61                 verify=verify)
     62         raise_for_status(r)
---> 63         das = safe_charset_text(r)
     64 
     65         # build the dataset from the DDS and add attributes from the DAS

/usr/local/lib/python3.7/dist-packages/pydap/handlers/dap.py in safe_charset_text(r)
    115     else:
    116         r.charset = get_charset(r)
--> 117         return r.text
    118 
    119 

/usr/local/lib/python3.7/dist-packages/webob/response.py in _text__get(self)
    620         decoding = self.charset or self.default_body_encoding
    621         body = self.body
--> 622         return body.decode(decoding, self.unicode_errors)
    623 
    624     def _text__set(self, value):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 94: ordinal not in range(128)

if i put a print statement in /usr/local/lib/python3.7/dist-packages/webob/response.py line 622 it tells me the decoding is set to ascii while to work in my case it should be utf-8 decoding is define few lines above by : decoding = self.charset or self.default_body_encoding so adding an other try/except to switch to utf-8 ... will work but this is a hack and most important .. this is happening on the client side... where I have no control on the pydap version used by a potential user do you have any suggestion?

Have you solved this problem? I am getting the same error message: ++++++++ code +++++++++++ #from pydap.client import open_url #import xarray as xr #import time

thredds_url = 'https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1840/' # ORNL DAAC TDS OPeNDAP URL # for Daymet V4 Daily Files

before = time.time() cnt = 0 for g_name in granule_names: print(' GRANULE_NAME ---->', g_name) granule_dap = thredds_url + g_name.replace('Daymet_Daily_V4.','') print(granule_dap)

# Using pydap's open_url 
thredds_ds = open_url(granule_dap) 

# Xarray DataSet - opening dataset via remote OPeNDAP 
ds = xr.open_dataset(xr.backends.PydapDataStore(thredds_ds), decode_coords="all")
    
temp=ds['prcp'].sel(x=slice(lccbounds.minx[0],lccbounds.maxx[0]), y=slice(lccbounds.maxy[0],lccbounds.miny[0]))

if cnt==0:
    prcp = temp
else:
    prcp = xr.concat([prcp, temp], dim="time")

cnt += 1

save to netcdf

prcp.to_netcdf(var + '_tdssubset.nc') print("Processing Time: ", time.time() - before, 'seconds') #Processing Time: 50.4509379863739 seconds

++++++++++ error message +++++++++++ GRANULE_NAME ----> Daymet_Daily_V4.daymet_v4_daily_na_prcp_2010.nc https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1840/daymet_v4_daily_na_prcp_2010.nc


UnicodeDecodeError Traceback (most recent call last) /tmp/ipykernel_4146/70333123.py in 14 15 # Using pydap's open_url ---> 16 thredds_ds = open_url(granule_dap) 17 18 # Xarray DataSet - opening dataset via remote OPeNDAP

~/bc_gov/lib/python3.8/site-packages/pydap/client.py in open_url(url, application, session, output_grid, timeout) 64 never retrieve coordinate axes. 65 """ ---> 66 dataset = DAPHandler(url, application, session, output_grid, 67 timeout).dataset 68

~/bc_gov/lib/python3.8/site-packages/pydap/handlers/dap.py in init(self, url, application, session, output_grid, timeout) 55 if not r.charset: 56 r.charset = 'ascii' ---> 57 dds = r.text 58 59 dasurl = urlunsplit((scheme, netloc, path + '.das', query, fragment))

~/bc_gov/lib/python3.8/site-packages/webob/response.py in _text__get(self) 620 decoding = self.charset or self.default_body_encoding 621 body = self.body --> 622 return body.decode(decoding, self.unicode_errors) 623 624 def _text__set(self, value):

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128) Screenshot from 2021-10-21 16-47-11

mbexhrs3 avatar Oct 22 '21 02:10 mbexhrs3