conda-store icon indicating copy to clipboard operation
conda-store copied to clipboard

Fix #167 : API performances via DB redesign

Open pierrotsmnrd opened this issue 2 years ago • 3 comments

As described in issue #167, the table conda_package needs to be split into 2 tables :

  • conda_package containing the data about the packages, regardless of the builds : name, license, version, summary ..
  • conda_package_build containing the data about the builds of a conda package : build number, constrains, md5, size, timestamp ...

Impacts of splitting this into two tables :

  • each environment build (table build) used to be linked to a conda_package. After this, they need to be associated to a conda_package_build.
  • One build has N conda_package_build, and one conda_package_build can be linked to N builds : this results in a N-N association. the table build_conda_package_build is used for this purpose.

Check the new DB schema :

output

I'm opening this PR as draft because I need to see how the API tests behave on GH actions, I struggle to make them work properly locally. I expect multiple changes to be needed.

Multiple questions for you @costrouc :

  • Question 1 : When I started this work, the tables solve and solve_conda_package were not there yet. I assume that solve_conda_package should point to a conda_package_build instead of a conda_package. Do you confirm ? If yes, I'll make the change.

  • Question 2 : inside the table conda_package, multiple data columns are still duplicated. For instance, the license, the description, the summary, is always the same for each package, regardless of the channel. Putting these data into as conda_package_metadata would make the DB lighter and save space, and improve performances as we would request these metadata only when needed. This can be done later in a separate PR, or now.

  • Question 3 : check the conflict below between my branch and main, in orm.py, that I don't resolve on purpose. The batch_size parameter was used to reduce the load on the DB when updating conda channels, right ? Now that we don't insert as much data as before, do you think it's still needed ?

  • Question 4 : do you have any metric relative to performances so I can compare ? Can we imagine having tests that measure the performance and helps comparing the improvements ? it's hard to do so manually.

TODO :

  • [x] Finish API impacts
  • [x] Make all tests work
  • [x] update documentation : DB schema, API, ...

pierrotsmnrd avatar May 13 '22 06:05 pierrotsmnrd

@costrouc The relevant commits are : 7a485ca3c09955146dff4f2e8ff86b0253247d39 7ecf00135bdf70e8694d4ed9ac28fc643e737146 e5d1495dba711a1ff914c96993e7eaf91901ae23 e955ef95b4fd3d2ae1d035c07bc43f86548a7245 dd290a0415e4a7c1b51834f7bff29e06524284db

pierrotsmnrd avatar May 24 '22 05:05 pierrotsmnrd

Meeting with Chris 2022-05-26 :

  • Question 1 : solve_conda_package should point to a conda_package_build instead of a conda_package
  • Question 2 : we can split conda_package into conda_package + conda_package_metadata. To do in another PR and test migration with it.
  • Question 3 : Multiple processes insert new rows into the conda package table. That leads to contraints errors when adding a package that already exists. SQLAlchemy doesn't seem to have a proper generic solution for this, we would need to do something specific for each DB engine. batch_size is a more convenient solution.
  • Question 4 : test with api/v1/package, there was 310.000 packages and it was very slow, 1 to 1.5 seconds.

Next steps :

  • [x] - Resolve question 1 with conda_package_build and solve
  • [x] - Resolve question 3 with batch_size + resolve the conflict in this PR
  • [x] - Open an migration-oriented issue, using question 2 as an example of migration to do
  • [x] - Resolve question 4 as described above
  • [x] - Finish API impacts
  • [ ] - Make all tests work
  • [ ] - Update documentation : DB schema, API, ...

pierrotsmnrd avatar May 26 '22 14:05 pierrotsmnrd

@pierrotsmnrd performance is great love how you preserved the api. I don't see any conflicts except for requiring redis. It is completely fine if it is optional. It is possible to check for redis and if it does not exist fallback with to retrying+backoff if bulk inserts fail.

costrouc avatar Oct 18 '22 16:10 costrouc