openfoodfacts-server icon indicating copy to clipboard operation
openfoodfacts-server copied to clipboard

Rebuild taxonomies as they change

Open alexgarel opened this issue 2 years ago β€’ 1 comments

When a taxonomy change is merged in main, create an action that rebuild taxonomies and propose it as a PR.

We do this because it will make validations on next pull requests faster (we have to rebuild changed taxonomies on every PR before launching tests).

We could also rebuild taxonomies directly on pull requests and commit but I find it messy:

  • it will add a lot of commits for every commit made by author
  • author as to constantly remind to pull again before commiting and some author will pull with a merge, making history on the PR quite difficult to read

alexgarel avatar Jun 13 '22 13:06 alexgarel

This issue is stale because it has been open 90 days with no activity.

github-actions[bot] avatar Sep 12 '22 00:09 github-actions[bot]

@stephanegigandet @teolemon IΒ want to tackle this since it makes me loose time as it takes a long time to rebuild taxonomy just to fix a single tests. On the other side, I fear that making rebuild more frequent will add quite an overhead to the git history (and thus size of the repository)

Ideally taxonomies are evolved on their own and merged to the project from times to times. This is what we already do for translations.

So here is my proposal:

  • have a separate branch for taxonomies
  • when someone modify a taxonomy, s/he does it on his branch, but, s/he must merge to the taxonomies branch (there is a CI that verifies this)
  • test are not run on this branch, only checks for taxonomy
  • at some point in time, a developer opens a PR for the taxonomy branch, rebuilds taxonomies, fixes the tests and merges it
  • after merge, we open a new branch from main to continue work on taxonomies

The merge could happen every two weeks generally, sooner or later depending on activity.

A variant to this could be to have a fork of the repository for taxonomies (thus separating PRs for taxonomies and maybe issues in a separate repository). I'm not sure if it's a good idea, as it may bring confusion.

alexgarel avatar Dec 19 '22 08:12 alexgarel

I would prefer to do it in a way similar to release please: whenever a taxonomy is merged to main, we have a taxonomy-build-please PR that gets created.

stephanegigandet avatar Dec 22 '22 11:12 stephanegigandet

I would prefer to do it in a way similar to release please: whenever a taxonomy is merged to main, we have a taxonomy-build-please PR that gets created.

We have to discuss it, because for me it does not really solve the problem.

alexgarel avatar Dec 23 '22 19:12 alexgarel

Could we use git submodules for this? So have the taxonomies in a separate repo and have the main PO repo link to this. Taxonomies can then progress independently and then from time to time (or when a code change specifically requires taxonomy changes) we'd refresh the submodule to point to the current head of the taxonomies repo and rebuild the taxonomy binary files (which could stay in the main PO repo).

john-gom avatar Jan 13 '23 17:01 john-gom

To make it even easier we could update the build_taxonomies script to automatically update the submodule to the latest revision on the taxonomies repo and then run the update_tests_results immediately after. That way we can specifically see the test result changes caused by taxonomy updates vs. the changes caused by updates to the code.

john-gom avatar Jan 13 '23 17:01 john-gom

Been working on this and I have a few concerns:

  1. When we look for taxonomies that have changed we don't take into account the dates on composite files, like ingredients. So if additives changes we don't automatically rebuild ingredients
  2. Very often the build process runs out of memory, so doesn't finish
  3. There is a circular dependency between languages and taxonomies

Just doing some tests now to see if rebuilding taxonomies from scratch gives same results as incremental

john-gom avatar Feb 06 '23 09:02 john-gom

Another problem is that taxonomy builds depend on languages, so a taxonomy may have indirectly changed because of a language change. Hence any language change would require a full taxonomy rebuild

john-gom avatar Feb 06 '23 09:02 john-gom

Refining 3 above it looks like the main circular dependency issue is nutrient_levels which is build automatically. Maybe should build this as part of build_lang?

john-gom avatar Feb 06 '23 10:02 john-gom

Another dependency problem is when taxonomies depend on one another, e.g. if one taxonomy references a synonym in another then this gets converted to the canonical name during the build process

john-gom avatar Feb 06 '23 11:02 john-gom