ambuda
ambuda copied to clipboard
Pull dictionary snapshots from ambuda-org instead of the source
Pulling dictionaries from original sources may strain the original dictionary sources.
- snapshot dictionaries
- check them into ambuda-org (a new repo).
- pull dictionaries from ambuda-dictionaries during builds.
- In ambuda-dictionaries, run periodic (monthly) upgrade checks on the sources to pull latest versions and save them tarballs.
@kvchitrapu I think the time is right for this as well, if you want to pursue it.
Sounds good. I'll work on them.
Created dictionaries
repo. Added this basic config https://github.com/ambuda-org/dictionaries/blob/main/src/dictionaries.yaml with a list of all dictionaries.
Planning to publish dictionaries snapshots as packages on ghcr.io.
@akprasad , @suhasm , is there a standard format for dictionaries? Today we are pulling dictionaries from various sources and treat them in specialized functions. Is there a standard dictionary format so the dictionaries can be parsed using a common function?
In other words, I'm checking if this workflow is possible:
- a new worker process can pull these dictionaries in zip files or XML or text.
- save dictionaries in a standard format
- compress and publish the packages
- Ambuda or any application pulls the packages and parse them using a generic function.
No, there isn't a standard dictionary format as far as I'm aware. Some dictionaries are in the Stardict format which is easy to parse but lacks formatting, etc. The Cologne dictionaries use their own conventions that are rich but a little harder to parse.
I am inclined to the data in this repo as "raw" as possible and do any transformation logic as part of the loading process on the Ambuda side. My thinking is:
- We can easily diff against the upstream data
- We have flexibility to change our transformation logic without causing huge diffs in the data
- These jobs aren't expensive in the scheme of things -- a few minutes a month, at most