finalfusion-rust icon indicating copy to clipboard operation
finalfusion-rust copied to clipboard

Pretrained embedding fetcher

Open danieldk opened this issue 4 years ago • 1 comments

I think it would be nice to have a small utility data structure to fetch pretrained embeddings. I don't think this needs to be part of the finalfusion crate, since it is not really core functionality. The basic idea is:

  • We'd have a repository finalfusion-fetcher with some metadata file (probably JSON), mapping embedding file identifiers to URLs. E.g. fasttext.wiki.nl.fifu could map to http://www.sfs.uni-tuebingen.de/a3-public-data/finalfusion-fasttext/wiki/wiki.nl.fifu

  • A small crate (possibly in the same repo), would provide a datastructure Fetcher With a constructor that retrieves the metadata and gives a fetcher:

    let fetcher = Fetcher::fetch_metadata().unwrap();
    

    A user could then open embeddings:

    let dutch_embeddings = fetcher.open("fasttext.wiki.nl.fifu").unwrap();
    

    This method would check if the embeddings are already available. If not, fetch them, store them in a standard XDG location. Then it would open the embeddings stored in this location.

    Similarly, Fetcher::mmap could be used to memory-map an embedding after downloading.

After this is implemented, the functionality could also be exposed in finalfusion-python.

danieldk avatar Sep 09 '19 10:09 danieldk

Sounds like a very convenient feature. Some feature to search for embeddings or to get a list of available files would also be nice to have.

sebpuetz avatar Sep 10 '19 16:09 sebpuetz