blazingsql icon indicating copy to clipboard operation
blazingsql copied to clipboard

[BUG] Running SQL queries on JSON-base tables restarts kernel

Open drabastomek opened this issue 4 years ago • 7 comments

Describe the bug I cannot execute any SQL queries against a table that is based on a JSON file. I have tried all the possible orient options.

Steps/Code to reproduce bug Here's a sample of queries that I tried. The underlying data is loaded in s3://bsql/data/samples/test.json (100 records only) and has the following schema (lines oriented, records):

{"VendorID":1,"passenger_count":2}
{"VendorID":2,"passenger_count":6}
{"VendorID":2,"passenger_count":1}

Code I tried.

  1. No parameters to create_table
from blazingsql import BlazingContext
bc = BlazingContext()
bc.s3('bsql', bucket_name = 'bsql')
bc.create_table('taxi', 's3://bsql/data/samples/test.json')

df = bc.sql('''
    SELECT *
    FROM taxi 
    LIMIT 10
''')

Kernel hangs and restarts.

  1. Passing lines parameter to create_table
from blazingsql import BlazingContext
bc = BlazingContext()
bc.s3('bsql', bucket_name = 'bsql')
bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True)

df = bc.sql('''
    SELECT *
    FROM taxi 
    LIMIT 10
''')

Kernel hangs and restarts.

  1. Passing lines and orient parameters to create_table
from blazingsql import BlazingContext
bc = BlazingContext()
bc.s3('bsql', bucket_name = 'bsql')
bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True, orient='records')

df = bc.sql('''
    SELECT *
    FROM taxi 
    LIMIT 10
''')

bc complains that the orient parameter is not recognized.

  1. I also tested all of these variants with different orient parameters I used to save the data as well as the date_format: https://docs.rapids.ai/api/cudf/stable/api.html#cudf.io.json.to_json

Expected behavior Able to create and query the table built on top of a JSON file.

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of BlazingSQL install: Docker
    • cudf version: 0.17.0
    • blazingsql version: 230c9c6fac4909baeaee84b253bdd5d7cae1652d

drabastomek avatar Jan 21 '21 00:01 drabastomek

cc: @williamBlazing, related to #349

drabastomek avatar Jan 21 '21 00:01 drabastomek

@drabastomek could you try using a nightly version (for both blazingsql and cudf) . The issue still persist? If so, this just happens when using a filesystem like s3 or the same for a local filesystem. I am not able to reproduce this, neither locally nor using the data/samples/test.json file in s3 with an updated env.

Christian8491 avatar Jan 28 '21 22:01 Christian8491

@Christian8491 I can try nightly and circle back. The kernel simply halts and restarts without any communication to the user.

drabastomek avatar Jan 28 '21 23:01 drabastomek

@Christian8491 I run this code on the rapids-nightly in beta:

from blazingsql import BlazingContext
bc = BlazingContext()
bc.s3('bsql', bucket_name = 'bsql')
bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True)

df = bc.sql('''
    SELECT *
    FROM taxi 
    LIMIT 10
''')

Got this error this time:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-89be86c1e1c8> in <module>
----> 1 from blazingsql import BlazingContext
      2 bc = BlazingContext()
      3 bc.s3('bsql', bucket_name = 'bsql')
      4 bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True)
      5 

~/.conda/envs/rapids-nightly/lib/python3.8/site-packages/blazingsql/__init__.py in <module>
      1 from pyblazing.apiv2 import S3EncryptionType
      2 from pyblazing.apiv2 import DataType
----> 3 from pyblazing.apiv2.context import BlazingContext
      4 
      5 from cio import getProductDetailsCaller

~/.conda/envs/rapids-nightly/lib/python3.8/site-packages/pyblazing/apiv2/context.py in <module>
      8 from threading import Lock
      9 from weakref import ref
---> 10 from pyblazing.apiv2.filesystem import FileSystem
     11 from pyblazing.apiv2 import DataType
     12 

~/.conda/envs/rapids-nightly/lib/python3.8/site-packages/pyblazing/apiv2/filesystem.py in <module>
      1 from collections import OrderedDict
      2 
----> 3 import cio
      4 
      5 from pyblazing.apiv2 import S3EncryptionType

ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory

drabastomek avatar Jan 29 '21 00:01 drabastomek

I just reproduced this issue (rapids-nightly in beta) about ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory but it fails due to the from blazingsql import BlazingContext line.

Christian8491 avatar Jan 29 '21 23:01 Christian8491

Correct. But the same code works in rapids-stable. I think our nightly env is broken. cc: @mario21ic

drabastomek avatar Jan 29 '21 23:01 drabastomek

@Christian8491 Update: on rapids-nightly the code from my previous comment using a single GPU runs correctly now.

However, if I use Dask then I get errors.

Repro:

from blazingsql import BlazingContext
from dask_gateway import Gateway
from distributed import Client

gateway = Gateway(address="http://a3469a832b3f249ff85f67ea80ce5efe-1647012415.us-west-2.elb.amazonaws.com",
                  auth='jupyterhub')
cluster = gateway.connect('dask-gateway.4efb6c10659c4873adcbcbce2d71fa8f')
client = Client(cluster)

bc = BlazingContext(dask_client=client)
bc.s3('bsql', bucket_name = 'bsql')
bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True)

df = bc.sql('''
    SELECT *
    FROM taxi 
    LIMIT 10
''')

produces the following

BlazingContext ready
S3 Storage Plugin Registered Successfully
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-0e0065f2cf24> in <module>
      4 bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True)
      5 
----> 6 df = bc.sql('''
      7     SELECT *
      8     FROM taxi

/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/pyblazing/apiv2/context.py in sql(self, query, algebra, return_futures, single_gpu, config_options)
   3029                     )
   3030                     i = i + 1
-> 3031                 graph_futures = self.dask_client.gather(graph_futures)
   3032 
   3033                 dask_futures = []

/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1991             else:
   1992                 local_worker = None
-> 1993             return self.sync(
   1994                 self._gather,
   1995                 futures,

/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    837             return future
    838         else:
--> 839             return sync(
    840                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    841             )

/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    338     if error[0]:
    339         typ, exc, tb = error[0]
--> 340         raise exc.with_traceback(tb)
    341     else:
    342         return result[0]

/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/distributed/utils.py in f()
    322             if callback_timeout is not None:
    323                 future = asyncio.wait_for(future, callback_timeout)
--> 324             result[0] = yield future
    325         except Exception as exc:
    326             error[0] = sys.exc_info()

/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1856                             exc = CancelledError(key)
   1857                         else:
-> 1858                             raise exception.with_traceback(traceback)
   1859                         raise exc
   1860                     if errors == "skip":

TypeError: generateGraphs() missing 1 required positional argument: 'sql'

drabastomek avatar Feb 09 '21 19:02 drabastomek