cross_domain_embedding
cross_domain_embedding copied to clipboard
cross_domain_embedding emnlp 2017
To run the cre.c, you need to compile it using any C compiler first.
Then:
1). Learn word embeddings from source domain using word2vec toolkit (https://code.google.com/archive/p/word2vec/). Add -bin argument to generate binary output format.
2). Generate a similarity score file. One line per word in '$word$ $score$' format. The word should appear in both the source and the target domain. The score must be a positive real number. The score file in the paper is generated through word frequency from corpus of two domains.
3). Run the cre program. Note that the size parameter (dimension of embeddings) should match the dimension of the embeddings from step 1).
Sample:
./cre -train <target_corpus_file.txt> -model <binary_embedding.bin> -similarity <similarity_score.txt> -output <output_name.bin> -size 50 -window 5 -binary 1 -lambda 10 -threads 20
Please cite the following paper:
@InProceedings{yang-lu-zheng:2017:EMNLP2017,
author = {Yang, Wei and Lu, Wei and Zheng, Vincent},
title = {A Simple Regularization-based Algorithm for Learning Cross-Domain Word Embeddings},
booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
month = {September},
year = {2017},
address = {Copenhagen, Denmark},
publisher = {Association for Computational Linguistics},
pages = {2888--2894},
url = {https://www.aclweb.org/anthology/D17-1311}
}
Contact [email protected] if you have further question on the code.