blazingsql
blazingsql copied to clipboard
[BUG] Running SQL queries on JSON-base tables restarts kernel
Describe the bug
I cannot execute any SQL queries against a table that is based on a JSON file. I have tried all the possible orient
options.
Steps/Code to reproduce bug
Here's a sample of queries that I tried. The underlying data is loaded in s3://bsql/data/samples/test.json
(100 records only) and has the following schema (lines oriented, records):
{"VendorID":1,"passenger_count":2}
{"VendorID":2,"passenger_count":6}
{"VendorID":2,"passenger_count":1}
Code I tried.
- No parameters to
create_table
from blazingsql import BlazingContext
bc = BlazingContext()
bc.s3('bsql', bucket_name = 'bsql')
bc.create_table('taxi', 's3://bsql/data/samples/test.json')
df = bc.sql('''
SELECT *
FROM taxi
LIMIT 10
''')
Kernel hangs and restarts.
- Passing
lines
parameter tocreate_table
from blazingsql import BlazingContext
bc = BlazingContext()
bc.s3('bsql', bucket_name = 'bsql')
bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True)
df = bc.sql('''
SELECT *
FROM taxi
LIMIT 10
''')
Kernel hangs and restarts.
- Passing
lines
andorient
parameters tocreate_table
from blazingsql import BlazingContext
bc = BlazingContext()
bc.s3('bsql', bucket_name = 'bsql')
bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True, orient='records')
df = bc.sql('''
SELECT *
FROM taxi
LIMIT 10
''')
bc
complains that the orient
parameter is not recognized.
- I also tested all of these variants with different
orient
parameters I used to save the data as well as thedate_format
: https://docs.rapids.ai/api/cudf/stable/api.html#cudf.io.json.to_json
Expected behavior Able to create and query the table built on top of a JSON file.
Environment overview (please complete the following information)
- Environment location: Docker
- Method of BlazingSQL install: Docker
-
cudf
version: 0.17.0 -
blazingsql
version: 230c9c6fac4909baeaee84b253bdd5d7cae1652d
-
cc: @williamBlazing, related to #349
@drabastomek could you try using a nightly version (for both blazingsql
and cudf
) . The issue still persist? If so, this just happens when using a filesystem like s3
or the same for a local filesystem. I am not able to reproduce this, neither locally nor using the data/samples/test.json
file in s3
with an updated env.
@Christian8491 I can try nightly and circle back. The kernel simply halts and restarts without any communication to the user.
@Christian8491 I run this code on the rapids-nightly in beta:
from blazingsql import BlazingContext
bc = BlazingContext()
bc.s3('bsql', bucket_name = 'bsql')
bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True)
df = bc.sql('''
SELECT *
FROM taxi
LIMIT 10
''')
Got this error this time:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-89be86c1e1c8> in <module>
----> 1 from blazingsql import BlazingContext
2 bc = BlazingContext()
3 bc.s3('bsql', bucket_name = 'bsql')
4 bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True)
5
~/.conda/envs/rapids-nightly/lib/python3.8/site-packages/blazingsql/__init__.py in <module>
1 from pyblazing.apiv2 import S3EncryptionType
2 from pyblazing.apiv2 import DataType
----> 3 from pyblazing.apiv2.context import BlazingContext
4
5 from cio import getProductDetailsCaller
~/.conda/envs/rapids-nightly/lib/python3.8/site-packages/pyblazing/apiv2/context.py in <module>
8 from threading import Lock
9 from weakref import ref
---> 10 from pyblazing.apiv2.filesystem import FileSystem
11 from pyblazing.apiv2 import DataType
12
~/.conda/envs/rapids-nightly/lib/python3.8/site-packages/pyblazing/apiv2/filesystem.py in <module>
1 from collections import OrderedDict
2
----> 3 import cio
4
5 from pyblazing.apiv2 import S3EncryptionType
ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory
I just reproduced this issue (rapids-nightly in beta) about ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory
but it fails due to the from blazingsql import BlazingContext
line.
Correct. But the same code works in rapids-stable. I think our nightly env is broken. cc: @mario21ic
@Christian8491
Update: on rapids-nightly
the code from my previous comment using a single GPU runs correctly now.
However, if I use Dask then I get errors.
Repro:
from blazingsql import BlazingContext
from dask_gateway import Gateway
from distributed import Client
gateway = Gateway(address="http://a3469a832b3f249ff85f67ea80ce5efe-1647012415.us-west-2.elb.amazonaws.com",
auth='jupyterhub')
cluster = gateway.connect('dask-gateway.4efb6c10659c4873adcbcbce2d71fa8f')
client = Client(cluster)
bc = BlazingContext(dask_client=client)
bc.s3('bsql', bucket_name = 'bsql')
bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True)
df = bc.sql('''
SELECT *
FROM taxi
LIMIT 10
''')
produces the following
BlazingContext ready
S3 Storage Plugin Registered Successfully
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-0e0065f2cf24> in <module>
4 bc.create_table('taxi', 's3://bsql/data/samples/test.json', lines=True)
5
----> 6 df = bc.sql('''
7 SELECT *
8 FROM taxi
/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/pyblazing/apiv2/context.py in sql(self, query, algebra, return_futures, single_gpu, config_options)
3029 )
3030 i = i + 1
-> 3031 graph_futures = self.dask_client.gather(graph_futures)
3032
3033 dask_futures = []
/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
1991 else:
1992 local_worker = None
-> 1993 return self.sync(
1994 self._gather,
1995 futures,
/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
837 return future
838 else:
--> 839 return sync(
840 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
841 )
/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
338 if error[0]:
339 typ, exc, tb = error[0]
--> 340 raise exc.with_traceback(tb)
341 else:
342 return result[0]
/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/distributed/utils.py in f()
322 if callback_timeout is not None:
323 future = asyncio.wait_for(future, callback_timeout)
--> 324 result[0] = yield future
325 except Exception as exc:
326 error[0] = sys.exc_info()
/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
/opt/conda/envs/rapids-nightly/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1856 exc = CancelledError(key)
1857 else:
-> 1858 raise exception.with_traceback(traceback)
1859 raise exc
1860 if errors == "skip":
TypeError: generateGraphs() missing 1 required positional argument: 'sql'