ambuda Pull dictionary snapshots from ambuda-org instead of the source

Pull dictionary snapshots from ambuda-org instead of the source

Open kvchitrapu opened this issue 2 years ago • 4 comments

Pulling dictionaries from original sources may strain the original dictionary sources.

snapshot dictionaries
check them into ambuda-org (a new repo).
pull dictionaries from ambuda-dictionaries during builds.
In ambuda-dictionaries, run periodic (monthly) upgrade checks on the sources to pull latest versions and save them tarballs.

Dec 25 '22 19:12 kvchitrapu

@kvchitrapu I think the time is right for this as well, if you want to pursue it.

Feb 17 '23 06:02 akprasad

Sounds good. I'll work on them.

Feb 17 '23 21:02 kvchitrapu

Created dictionaries repo. Added this basic config https://github.com/ambuda-org/dictionaries/blob/main/src/dictionaries.yaml with a list of all dictionaries.

Planning to publish dictionaries snapshots as packages on ghcr.io.

@akprasad , @suhasm , is there a standard format for dictionaries? Today we are pulling dictionaries from various sources and treat them in specialized functions. Is there a standard dictionary format so the dictionaries can be parsed using a common function?

In other words, I'm checking if this workflow is possible:

a new worker process can pull these dictionaries in zip files or XML or text.
save dictionaries in a standard format
compress and publish the packages
Ambuda or any application pulls the packages and parse them using a generic function.

Feb 21 '23 14:02 kvchitrapu

No, there isn't a standard dictionary format as far as I'm aware. Some dictionaries are in the Stardict format which is easy to parse but lacks formatting, etc. The Cologne dictionaries use their own conventions that are rich but a little harder to parse.

I am inclined to the data in this repo as "raw" as possible and do any transformation logic as part of the loading process on the Ambuda side. My thinking is:

We can easily diff against the upstream data
We have flexibility to change our transformation logic without causing huge diffs in the data
These jobs aren't expensive in the scheme of things -- a few minutes a month, at most

Feb 21 '23 16:02 akprasad

ambuda ambuda copied to clipboard

Pull dictionary snapshots from ambuda-org instead of the source

ambuda
ambuda copied to clipboard