FEL icon indicating copy to clipboard operation
FEL copied to clipboard

Output hash file size surprisingly small when mining Wikipedia to train our model

Open shubhamagarwal92 opened this issue 5 years ago • 0 comments

Thank you for providing the code.

We were trying out to mine wikipedia using this shell script for our entity linker using the dump for 2018/05/01. We were able to generate the hash file but surprisingly the file size was 284 MB. In contrast, the pre-trained model provided has a file size of 1.3G for English Hash trained from November 2015 Wikipedia

@aasish could you suggest what might be happening wrong. Is it because of the compression or are we missing out on some entities? Is there a way that we could combine both the hash files so that we can take into account the recent entities.

shubhamagarwal92 avatar Jul 18 '18 13:07 shubhamagarwal92