PyHive icon indicating copy to clipboard operation
PyHive copied to clipboard

Add sparksql dialect

Open gmcoringa opened this issue 7 years ago • 20 comments

This PR was based on https://github.com/dropbox/PyHive/pull/187 and only add some fixes due PEP8.

No unit tests for this new dialect were added because many tests done by sqlalchemy_test_case will fail due the lack of support of some types by spark (SPARK-21529).

gmcoringa avatar Oct 18 '18 18:10 gmcoringa

Codecov Report

Merging #247 into master will decrease coverage by 2.81%. The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #247      +/-   ##
==========================================
- Coverage   93.94%   91.12%   -2.82%     
==========================================
  Files          14       15       +1     
  Lines        1487     1533      +46     
  Branches      159      169      +10     
==========================================
  Hits         1397     1397              
- Misses         64      108      +44     
- Partials       26       28       +2
Impacted Files Coverage Δ
pyhive/sqlalchemy_sparksql.py 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d19cb0c...a07d74d. Read the comment docs.

codecov-io avatar Oct 18 '18 18:10 codecov-io

What is holding this PR up?

@jingw - Is there something that we can do to move this PR along and make it part of the project, or does Spark SQL not fit the mission?

nchammas avatar Oct 06 '19 01:10 nchammas

Many projects relying on PyHive experience problem #150 . Is there any way we can make this PR merged?

serialx avatar Oct 15 '19 12:10 serialx

+1. Any plans to get this merged?

prongs avatar May 08 '20 07:05 prongs

Haven't seen this PR, looks nice. If someone could add unit tests, will be happy to merge and do another pyhive release.

bkyryliuk avatar May 08 '20 17:05 bkyryliuk

@bkyryliuk it seems there has been efforts to get Spark SQL in for a long time, but many previous PRs have gone stale in the end. As potential problems are limited to Spark SQL only, in the interest of getting this functionality out there, I wonder if it would make sense to let this in without rigorous tests, and add tests later if/when problems surface?

villebro avatar May 12 '20 05:05 villebro

It would be quite challenging to maintain from our prospective as we don't leverage spark much. I am not looking for 100 % test coverage, but would prefer to have at least a smoke test.

Presto & hive setup doesn't seem to be very involved process: https://github.com/dropbox/PyHive/blob/master/scripts/travis-install.sh I assume spark would be somewhat similar

bkyryliuk avatar May 12 '20 16:05 bkyryliuk

@bkyryliuk I've tried to add some unit tests, but many done by sqlalchemy_test_case will fail due the lack of support by spark. It's possible to do some tests, but all tests done by sqlalchemy_test_case will be omitted.

gmcoringa avatar May 12 '20 17:05 gmcoringa

you can use sqlalchemy engine in those test to do a pass for the not supported functions. Superset has a good example: https://github.com/apache/incubator-superset/blob/903217f64d38b2083bb62a8a2b81686a607ba479/tests/sqllab_tests.py#L76

bkyryliuk avatar May 12 '20 19:05 bkyryliuk

So many people will be so happy if this is merged and released soon :)

ali-bahjati avatar May 13 '20 21:05 ali-bahjati

@bkyryliuk I can help write some tests if necessary, I think this would be a really nice feature to have.

villebro avatar May 13 '20 21:05 villebro

What's the status on this? Would be happy to help.

mickymiek avatar Nov 27 '20 15:11 mickymiek

I applied your changes to my project, but had a little issue with column containing # Partitioning", "Not partitioned" and one being empty. I saw there was a filter on a similar column in sqlalchemy_hive.py. I guess this differs from one hive version to another (using 2.3.7 here).

What I did to fix this was to change this line from sqlalchemy_sparksql.py to rows = [column for column in connection.execute('DESCRIBE {}'.format(full_table)).fetchall() if column[0] not in {"# Partitioning", "Not partitioned", ""}].

mickymiek avatar Dec 03 '20 13:12 mickymiek

Any updates. Maybe we can help to get this through!

panoptikum avatar Sep 23 '21 16:09 panoptikum

I have been watching this one but haven't seen any action...

Data-drone avatar Sep 30 '21 12:09 Data-drone

Is there any progress with this? This is causing issues in related applications like Superset

tomkos avatar Oct 18 '21 20:10 tomkos

Any updates regarding this PR?

OmarRehan avatar Jan 28 '22 21:01 OmarRehan

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

:white_check_mark: gmcoringa
:x: Hao Qin Tan


Hao Qin Tan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Apr 16 '22 21:04 CLAassistant

Hello mates, What could we do to move ahead with this one ? ready to help!

jbguerraz avatar Jul 13 '22 14:07 jbguerraz

Same here, this is particularly important when it comes to catalog metadata fetch in tools like Superset, currently, we cant use physical references to the table, only virtual SQL queries, and metadata exploration using the UI is blocked.

pedrosalgadowork avatar May 15 '23 10:05 pedrosalgadowork