smart_open icon indicating copy to clipboard operation
smart_open copied to clipboard

Azure streaming binary data error

Open Joe-Heffer-Shef opened this issue 1 year ago • 7 comments

Problem description

I am trying to stream a binary file from Azure Blob Storage.

I expect to be able to iterate over chunks of the data set, but I see an error do with the Azure readinto function.

I'm using the npTDMS library to read a LabVIEW data file in TDMS format (binary quantitative data files.)

Steps/code to reproduce the problem

The code is something like this:

import azure.storage.blob
import smart_open
import nptdms

CONN_STR = '******************'
BLOB_URI = 'azure://test/my_data_file.tdms'

transport_params = dict(
    client=azure.storage.blob.BlobServiceClient.from_connection_string(conn_str=CONN_STR),
)

with smart_open.open(BLOB_URI, mode='rb', transport_params=transport_params) as file:

    with nptdms.TdmsFile.open(file) as tdms_file:
        for group in tdms_file.groups():
            for channel in group.channels():
                for chunk in channel.data_chunks():
                    pass

and the error I get is:

Traceback (most recent call last):
  File "C:\Users\my_username\my_project\scripts\blob-tdms\smart.py", line 35, in <module>
    main()
  File "C:\Users\my_username\my_project\scripts\blob-tdms\smart.py", line 28, in main
    for chunk in channel.data_chunks():
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\tdms.py", line 564, in data_chunks
    for raw_data_chunk in self._read_channel_data_chunks():
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\tdms.py", line 758, in _read_channel_data_chunks
    for chunk in self._reader.read_raw_data_for_channel(self.path):
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\reader.py", line 191, in read_raw_data_for_channel
    for i, chunk in enumerate(
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 269, in read_raw_data_for_channel
    for chunk in self._read_channel_data_chunks(f, data_objects, channel_path, chunk_offset, stop_chunk):
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 367, in _read_channel_data_chunks
    for chunk in reader.read_channel_data_chunks(file, data_objects, channel_path, chunk_offset, stop_chunk):
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 64, in read_channel_data_chunks
    yield self._read_channel_data_chunk(file, data_objects, chunk_index, channel_path)
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 72, in _read_channel_data_chunk
    data_chunk = self._read_data_chunk(file, data_objects, chunk_index)
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\daqmx.py", line 39, in _read_data_chunk
    combined_data = read_interleaved_segment_bytes(file, raw_data_width, chunk_size)
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 159, in read_interleaved_segment_bytes
    combined_data = fromfile(f, dtype=np.uint8, count=number_bytes)
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 147, in fromfile
    bytes_read = file.readinto(buffer[offset:])
  File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\smart_open\azure.py", line 322, in readinto
    b[:len(data)] = data
ValueError: invalid literal for int() with base 10: b'\x93\xad\x03\x00k\xf0\xff\xff\xfe\xee\xff\xffm\xfd\xff\xffd\xc1E\x00<\xad\x03\x00O\xf0\xff\xffI\xee\xff\xff\xd1\xfd\xff\xff\xbe\xc2E\x00\xe8\xac\x03\x00\xa6\xef\xff\xff\xe5\xed\xff\xff\x92\xfd\xff\x

It seems like it's expecting a text file? Or it's not calculating the data index correctly to page through the data set?

Versions

>>> import platform, sys, smart_open
>>> print(platform.platform())
Windows-10-10.0.19042-SP0
>>> print("Python", sys.version)
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:15:42) [MSC v.1916 64 bit (AMD64)]
>>> print("smart_open", smart_open.__version__)
smart_open 6.1.0

From pip list:

azure-core          1.23.0
azure-storage-blob  12.10.0
npTDMS              1.4.0
smart-open          6.1.0

Joe-Heffer-Shef avatar Aug 23 '22 14:08 Joe-Heffer-Shef

Can you poke around with a debugger?

ValueError: invalid literal for int() with base 10

I don't see int() in the stack trace anywhere... I wonder what's actually raising that exception.

mpenkov avatar Aug 23 '22 23:08 mpenkov

It looks like it's something to do with how len works.

For example:

>>> import os
>>> data = os.urandom(32)
>>> data
b"\xaf\xc6\x89\xc4xt2s'_\xc5\xd3\xb1\xe9\x86\xa5&\x80\xf2!\x96q\xff\xbc\x81?\xc4\x8e\x14q\xe9E"
>>> len(data)
32

Joe-Heffer-Shef avatar Aug 24 '22 08:08 Joe-Heffer-Shef

I don't see any int() in your example either – how do you mean?

piskvorky avatar Aug 24 '22 08:08 piskvorky

I think the Python built-in function len is created using CPython so the source code isn't available. https://docs.python.org/3/library/functions.html#len

def len(*args, **kwargs): # real signature unknown
    """ Return the number of items in a container. """
    pass

This means we won't see int in the stack trace.

It guess when calling len(s) it tries to cast the size of the argument s to an integer. For some reason this part of the code gives a binary data value for the size of the data variable?

File "C:\Users\my_username\Miniconda3\envs\my_project\lib\site-packages\smart_open\azure.py", line 322, in readinto
    b[:len(data)] = data

Joe-Heffer-Shef avatar Aug 24 '22 09:08 Joe-Heffer-Shef

For what possible values of b and data will b[:len(data)] = data (or parts of it) raise that exception?

If you're able to dig in with a debugger, it would be good to know what those values are.

mpenkov avatar Aug 24 '22 19:08 mpenkov

I believe this is an issue under the hood with the readinto implementation. I run into this same error when using S3 and Linux. The problem seems to be assigning a binary string into a numpy array. Perhaps the exception that the next line catches should be ValueError instead of AttributeError?

nharada1 avatar Aug 26 '22 01:08 nharada1

For what possible values of b and data will b[:len(data)] = data (or parts of it) raise that exception?

If you're able to dig in with a debugger, it would be good to know what those values are.

I ran the script using the PyCharm debugger.

Here are the values of the variables when the exception occurs:

# type: numpy.ndarray
b = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# type: bytes
data = b'\x00\x00\x00\x00\x00\xf05\xbf\x00\x00\x00\x00.... (lots of binary data)

This is the traceback:

Traceback (most recent call last):
  File "C:/Users/my_username/my_project/scripts/blob-tdms/smart.py", line 45, in main
    for chunk in channel.data_chunks():
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms.py", line 586, in data_chunks
    for raw_data_chunk in self._read_channel_data_chunks():
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms.py", line 780, in _read_channel_data_chunks
    for chunk in self._reader.read_raw_data_for_channel(self.path):
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\reader.py", line 218, in read_raw_data_for_channel
    for i, chunk in enumerate(
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 269, in read_raw_data_for_channel
    for chunk in self._read_channel_data_chunks(f, data_objects, channel_path, chunk_offset, stop_chunk):
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 367, in _read_channel_data_chunks
    for chunk in reader.read_channel_data_chunks(file, data_objects, channel_path, chunk_offset, stop_chunk):
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 64, in read_channel_data_chunks
    yield self._read_channel_data_chunk(file, data_objects, chunk_index, channel_path)
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 492, in _read_channel_data_chunk
    channel_data = RawChannelDataChunk.channel_data(obj.read_values(file, number_values, self.endianness))
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\tdms_segment.py", line 557, in read_values
    return fromfile(file, dtype=dtype, count=number_values)
  File "C:\Anaconda\envs\my_project\lib\site-packages\nptdms\base_segment.py", line 147, in fromfile
    bytes_read = file.readinto(buffer[offset:])
  File "C:\Anaconda\envs\my_project\lib\site-packages\smart_open\azure.py", line 322, in readinto
    b[:len(data)] = data
ValueError: invalid literal for int() with base 10: b"\x00\x00\x00\x00\x00\xf05\xbf\x00\x00\x00\x00\x00\xa0=\xbf\x00\x00\x00\x00\x00P<\xbf\x00\x00\x00\x00\x00\xd0G\xbf\x00\x00\x00\x00\x00\xd0M\xbf\x00\x00\x00\x00\x00PL\xbf\x00\x00\x00\x00\x00\x98F\xbf\

This is the code in azure.py where the crash happens:

    def readinto(self, b):
        """Read up to len(b) bytes into b, and return the number of bytes read."""
        data = self.read(len(b))
        if not data:
            return 0
        b[:len(data)] = data
        return len(data)

Please note I've updated the package versions like so: (Conda environment.yaml file)

name: my_env
channels:
  - conda-forge
  - defaults
dependencies:
  - ca-certificates=2022.9.14=h5b45459_0
  - certifi=2022.9.14=pyhd8ed1ab_0
  - nptdms=1.6.0=pyhd8ed1ab_0
  - smart-open=6.2.0=pyh1a96a4e_0
  - smart_open=6.2.0=pyha770c72_0

Joe-Heffer-Shef avatar Sep 15 '22 09:09 Joe-Heffer-Shef