No file existence assurance
The generate_embedding.py file's first operations are loading the models and defining the stop words:
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('./crawl-300d-2M.vec', binary=False)
stop_words = set(stopwords.words('english'))
Which both are very time consuming, it took me almost 20 minutes using google colab. This situation result that if an error has occurred afterwards the wait was for vain. For example the run may crash for file does not exists in line 126:
template_df = pd.read_csv(f'./{dataset}/{dataset}.log_templates.csv')
Is it possible to add a mechanism to assure that the required files exists prior to the 'heavy' operations in order to save some time to new-comers? Simply adding something like
template_file = Path(f"./{dataset}/{dataset}.log_templates.json")
if not template_file.exists:
raise FileNotFoundError("Template file does not exists")
At the top of the file can be very useful.
I want to convert the .vec file to binary and save it and then read it, and the speed will be greatly improved, on my machine, from 500s to 25s