conda-store
conda-store copied to clipboard
Fix #167 : API performances via DB redesign
As described in issue #167, the table conda_package
needs to be split into 2 tables :
-
conda_package
containing the data about the packages, regardless of the builds : name, license, version, summary .. -
conda_package_build
containing the data about the builds of a conda package : build number, constrains, md5, size, timestamp ...
Impacts of splitting this into two tables :
- each environment build (table
build
) used to be linked to aconda_package
. After this, they need to be associated to aconda_package_build
. - One build has N
conda_package_build
, and oneconda_package_build
can be linked to N builds : this results in a N-N association. the tablebuild_conda_package_build
is used for this purpose.
Check the new DB schema :
I'm opening this PR as draft because I need to see how the API tests behave on GH actions, I struggle to make them work properly locally. I expect multiple changes to be needed.
Multiple questions for you @costrouc :
-
Question 1 : When I started this work, the tables
solve
andsolve_conda_package
were not there yet. I assume thatsolve_conda_package
should point to aconda_package_build
instead of aconda_package
. Do you confirm ? If yes, I'll make the change. -
Question 2 : inside the table
conda_package
, multiple data columns are still duplicated. For instance, the license, the description, the summary, is always the same for each package, regardless of the channel. Putting these data into asconda_package_metadata
would make the DB lighter and save space, and improve performances as we would request these metadata only when needed. This can be done later in a separate PR, or now. -
Question 3 : check the conflict below between my branch and main, in
orm.py
, that I don't resolve on purpose. Thebatch_size
parameter was used to reduce the load on the DB when updating conda channels, right ? Now that we don't insert as much data as before, do you think it's still needed ? -
Question 4 : do you have any metric relative to performances so I can compare ? Can we imagine having tests that measure the performance and helps comparing the improvements ? it's hard to do so manually.
TODO :
- [x] Finish API impacts
- [x] Make all tests work
- [x] update documentation : DB schema, API, ...
@costrouc The relevant commits are : 7a485ca3c09955146dff4f2e8ff86b0253247d39 7ecf00135bdf70e8694d4ed9ac28fc643e737146 e5d1495dba711a1ff914c96993e7eaf91901ae23 e955ef95b4fd3d2ae1d035c07bc43f86548a7245 dd290a0415e4a7c1b51834f7bff29e06524284db
Meeting with Chris 2022-05-26 :
- Question 1 : solve_conda_package should point to a conda_package_build instead of a conda_package
- Question 2 : we can split conda_package into conda_package + conda_package_metadata. To do in another PR and test migration with it.
- Question 3 : Multiple processes insert new rows into the conda package table. That leads to contraints errors when adding a package that already exists. SQLAlchemy doesn't seem to have a proper generic solution for this, we would need to do something specific for each DB engine. batch_size is a more convenient solution.
- Question 4 : test with
api/v1/package
, there was 310.000 packages and it was very slow, 1 to 1.5 seconds.
Next steps :
- [x] - Resolve question 1 with conda_package_build and solve
- [x] - Resolve question 3 with batch_size + resolve the conflict in this PR
- [x] - Open an migration-oriented issue, using question 2 as an example of migration to do
- [x] - Resolve question 4 as described above
- [x] - Finish API impacts
- [ ] - Make all tests work
- [ ] - Update documentation : DB schema, API, ...
@pierrotsmnrd performance is great love how you preserved the api. I don't see any conflicts except for requiring redis. It is completely fine if it is optional. It is possible to check for redis and if it does not exist fallback with to retrying+backoff if bulk inserts fail.