ambuda icon indicating copy to clipboard operation
ambuda copied to clipboard

Pull dictionary snapshots from ambuda-org instead of the source

Open kvchitrapu opened this issue 2 years ago • 4 comments

Pulling dictionaries from original sources may strain the original dictionary sources.

  1. snapshot dictionaries
  2. check them into ambuda-org (a new repo).
  3. pull dictionaries from ambuda-dictionaries during builds.
  4. In ambuda-dictionaries, run periodic (monthly) upgrade checks on the sources to pull latest versions and save them tarballs.

kvchitrapu avatar Dec 25 '22 19:12 kvchitrapu

@kvchitrapu I think the time is right for this as well, if you want to pursue it.

akprasad avatar Feb 17 '23 06:02 akprasad

Sounds good. I'll work on them.

kvchitrapu avatar Feb 17 '23 21:02 kvchitrapu

Created dictionaries repo. Added this basic config https://github.com/ambuda-org/dictionaries/blob/main/src/dictionaries.yaml with a list of all dictionaries.

Planning to publish dictionaries snapshots as packages on ghcr.io.

@akprasad , @suhasm , is there a standard format for dictionaries? Today we are pulling dictionaries from various sources and treat them in specialized functions. Is there a standard dictionary format so the dictionaries can be parsed using a common function?

In other words, I'm checking if this workflow is possible:

  1. a new worker process can pull these dictionaries in zip files or XML or text.
  2. save dictionaries in a standard format
  3. compress and publish the packages
  4. Ambuda or any application pulls the packages and parse them using a generic function.

kvchitrapu avatar Feb 21 '23 14:02 kvchitrapu

No, there isn't a standard dictionary format as far as I'm aware. Some dictionaries are in the Stardict format which is easy to parse but lacks formatting, etc. The Cologne dictionaries use their own conventions that are rich but a little harder to parse.

I am inclined to the data in this repo as "raw" as possible and do any transformation logic as part of the loading process on the Ambuda side. My thinking is:

  • We can easily diff against the upstream data
  • We have flexibility to change our transformation logic without causing huge diffs in the data
  • These jobs aren't expensive in the scheme of things -- a few minutes a month, at most

akprasad avatar Feb 21 '23 16:02 akprasad