LogADEmpirical icon indicating copy to clipboard operation
LogADEmpirical copied to clipboard

No file existence assurance

Open GilPasi opened this issue 1 year ago • 1 comments

The generate_embedding.py file's first operations are loading the models and defining the stop words:

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('./crawl-300d-2M.vec', binary=False)
stop_words = set(stopwords.words('english'))

Which both are very time consuming, it took me almost 20 minutes using google colab. This situation result that if an error has occurred afterwards the wait was for vain. For example the run may crash for file does not exists in line 126:

    template_df = pd.read_csv(f'./{dataset}/{dataset}.log_templates.csv')

Is it possible to add a mechanism to assure that the required files exists prior to the 'heavy' operations in order to save some time to new-comers? Simply adding something like

    template_file = Path(f"./{dataset}/{dataset}.log_templates.json")
    if not template_file.exists:
        raise FileNotFoundError("Template file does not exists")

At the top of the file can be very useful.

GilPasi avatar Oct 30 '24 13:10 GilPasi

I want to convert the .vec file to binary and save it and then read it, and the speed will be greatly improved, on my machine, from 500s to 25s Image

hopeyl avatar Apr 18 '25 04:04 hopeyl