openfoodfacts-server
openfoodfacts-server copied to clipboard
Rebuild taxonomies as they change
When a taxonomy change is merged in main, create an action that rebuild taxonomies and propose it as a PR.
We do this because it will make validations on next pull requests faster (we have to rebuild changed taxonomies on every PR before launching tests).
We could also rebuild taxonomies directly on pull requests and commit but I find it messy:
- it will add a lot of commits for every commit made by author
- author as to constantly remind to pull again before commiting and some author will pull with a merge, making history on the PR quite difficult to read
This issue is stale because it has been open 90 days with no activity.
@stephanegigandet @teolemon IΒ want to tackle this since it makes me loose time as it takes a long time to rebuild taxonomy just to fix a single tests. On the other side, I fear that making rebuild more frequent will add quite an overhead to the git history (and thus size of the repository)
Ideally taxonomies are evolved on their own and merged to the project from times to times. This is what we already do for translations.
So here is my proposal:
- have a separate branch for taxonomies
- when someone modify a taxonomy, s/he does it on his branch, but, s/he must merge to the taxonomies branch (there is a CI that verifies this)
- test are not run on this branch, only checks for taxonomy
- at some point in time, a developer opens a PR for the taxonomy branch, rebuilds taxonomies, fixes the tests and merges it
- after merge, we open a new branch from main to continue work on taxonomies
The merge could happen every two weeks generally, sooner or later depending on activity.
A variant to this could be to have a fork of the repository for taxonomies (thus separating PRs for taxonomies and maybe issues in a separate repository). I'm not sure if it's a good idea, as it may bring confusion.
I would prefer to do it in a way similar to release please: whenever a taxonomy is merged to main, we have a taxonomy-build-please PR that gets created.
I would prefer to do it in a way similar to release please: whenever a taxonomy is merged to main, we have a taxonomy-build-please PR that gets created.
We have to discuss it, because for me it does not really solve the problem.
Could we use git submodules for this? So have the taxonomies in a separate repo and have the main PO repo link to this. Taxonomies can then progress independently and then from time to time (or when a code change specifically requires taxonomy changes) we'd refresh the submodule to point to the current head of the taxonomies repo and rebuild the taxonomy binary files (which could stay in the main PO repo).
To make it even easier we could update the build_taxonomies script to automatically update the submodule to the latest revision on the taxonomies repo and then run the update_tests_results immediately after. That way we can specifically see the test result changes caused by taxonomy updates vs. the changes caused by updates to the code.
Been working on this and I have a few concerns:
- When we look for taxonomies that have changed we don't take into account the dates on composite files, like ingredients. So if additives changes we don't automatically rebuild ingredients
- Very often the build process runs out of memory, so doesn't finish
- There is a circular dependency between languages and taxonomies
Just doing some tests now to see if rebuilding taxonomies from scratch gives same results as incremental
Another problem is that taxonomy builds depend on languages, so a taxonomy may have indirectly changed because of a language change. Hence any language change would require a full taxonomy rebuild
Refining 3 above it looks like the main circular dependency issue is nutrient_levels which is build automatically. Maybe should build this as part of build_lang?
Another dependency problem is when taxonomies depend on one another, e.g. if one taxonomy references a synonym in another then this gets converted to the canonical name during the build process