[PECO-1803] Splitting the PySql connector into the core and the non core part
Related Links
databricks_sqlalchemy split is present in this PR - https://github.com/databricks/databricks-sqlalchemy/pull/1
Description
databricks-sql-python library is split so that package size can be reduced for the end user based on their requirements Particularly pyarrow is the heavy component that is planned to be kept optional existing library split into
- databricks-sql-connector ( This is kept, so that the existing users import flow does not change )
- databricks-sql-connector-core ( This is the lightweight library that separates the core part )
Tasks Completed
- [x] Refractored the code into its respective folders based on the proposed design doc
- [x] pyproject.toml file has been changed to reflect the proper dependencies for the split
- [x] Made sure that all the existing e2e and units tests are working pre and post spit, ensuring parity
- [x] Added benchmarking queries to test the performance of pre and post split and a dashboard has been created for visualization
- [x] Dependency tests are also added to check how the library behaves when certain libraries are not available and the user requests their functions
How to Test
Testing pipeline remains the same as it is before the split.
pytest can be used to directly run both the integration as well as unit tests, by pytest [directory_name or file_name]
Addition of dist folder in this repo
Github actions have been setup in the databricks_sqlalchemy repo to run tests using the databricks_sql_connector_core. For running those tests currently we need the .whl file in the dist folder and for temporary testing it has been added to the PR.
Once the library gets published to a public repository such as PyPi then databricks_sqlalchemy will automatically download it from that repo and run the tests using Github actions
Performance Comparison - Benchmarking
The pre-split and post-split preformance comparison has been made using the large and small queries to make sure their is no regression of performance
Dashboard has been created so that everytime the benchmarking is run the result are stored in the benchfood, and comparisons can be made easily
As the first iteration, it is more or less okay. Still, there are few major things to fix.
The core package should have most of the dependencies, as well as tests, and so on. The dependencies like pyarrow should be optional for it, and thus not installed by default. The core package should keep the original project's structure - in other words, everything inside of databricks_sql_connector_core should remain the same as it is now, but just moved into own folder.
The full package should basically have no source files, and the onlt dependency of the core package, but will all its optional dependencies enabled. Please refer to Poetry docs - https://python-poetry.org/docs/pyproject/#extras (look for extras=["all"] config of the dependency)
Why it is important and how it should work? When you publish the package, Poetry collects files from the folder where pyproject.toml file is located, puts them into archive and uploads to Pypi under the name specified in pyproject.toml. Later, when package is installed, pip basically just unpacks that archive as is to the folder where all packages are located (.../lib/python3/site-packages/). Let's imagine that we created two packages which both contain their sources in the folder xyz. After installing both of them, both will be corrupted because their sources will be extracted to the same path on target machine.
So with the configuration described above, we will have all the source code located in the core package. After installation, it will be extracted to the same folders as earlier, so users will not have to update their imports. Optional dependencies will not be installed by default, but if user already has them installed (e.g. pyarrow) - correcsponding features will immediately become (remain) available.
Installing the full package will not add any new source files to the user's machine. Everything is already installed with the core package. The full package should just tell pip to install optional dependencies.