MatchZoo-py icon indicating copy to clipboard operation
MatchZoo-py copied to clipboard

How to load large embedding efficiently?

Open matthew-z opened this issue 4 years ago • 1 comments

Describe the Question

I tried to load 840B+300d GloVe using mz.embedding.load_from_file. However, it utilizes more than 60+ GB memory, which looks abnormal.

from pathlib import Path
import matchzoo as mz


_glove_6B_embedding_url = "http://nlp.stanford.edu/data/glove.6B.zip"
_glove_840B_embedding_url = "http://nlp.stanford.edu/data/glove.840B.300d.zip"


def load_glove_embedding(dimension: int = 50, size="6B") -> mz.embedding.Embedding:
    """
    Return the pretrained glove embedding.

    :param dimension: the size of embedding dimension, the value can only be
        50, 100, or 300.
    :return: The :class:`mz.embedding.Embedding` object.
    """
    file_name = 'glove.{}.{}d.txt'.format(size, dimension)
    file_path = (Path(mz.USER_DATA_DIR) / 'glove').joinpath(file_name)

    if not file_path.exists():
        if size=="6B":
            url = _glove_6B_embedding_url
        elif size == "840B":
            url = _glove_840B_embedding_url
        else:
            raise ValueError("Incorrect Size for GloVe: %d" % size)

        mz.utils.get_file('glove_embedding',
                                        url,
                                        extract=True,
                                        cache_dir=mz.USER_DATA_DIR,
                                        cache_subdir='glove')

    return mz.embedding.load_from_file(file_path=str(file_path), mode='glove')

embedding = load_glove_embedding(300, "840B")

Describe your attempts

The TF version matchzoo uses pandas to read the GloVe file, and requires much less memory.

matthew-z avatar Nov 22 '19 11:11 matthew-z

Thanks for your feedback. We will fix it soon.

Chriskuei avatar Nov 22 '19 13:11 Chriskuei