brown-cluster
brown-cluster copied to clipboard
Is there any limit for the vocab size (#types)?
The code fails (with core dump: segmentation fault message) when I run it on a huge txt file (about 20M types and 14GB file size). I already used wcluster for different files with much less types and it worked pretty well.
Is there any limit for the vocabulary size (#types)?
I'm not sure what the exact limit is but I'm not surprised that it failed with 20M types. You can try using the restrict command line option to restrict it to a smaller vocabulary. The brown clustering algorithm dates back to a time when people didn't have 14GB text files to work with.
On Thu, Mar 3, 2016 at 9:04 PM, Mohammad Sadegh Rasooli < [email protected]> wrote:
The code fails (with core dump: segmentation fault message) when I run it on a huge txt file (about 20M types and 14GB file size). I already used wcluster for different files with much less types and it worked pretty well.
Is there any limit for the vocabulary size (#types)?
— Reply to this email directly or view it on GitHub https://github.com/percyliang/brown-cluster/issues/14.
I have noticed that at the end of March a new commit was performed. The commit is labeled "Enable >= 2^31 tokens in input data" so I thought it would have addressed the issue raised here. However, I still ran into an issue similar to the one mentioned by rasoolims. I'm able to successfully run the code only with a file containing 10M tokens (700K types). With bigger files it fails saying "core dump: segmentation fault". Any suggestion?
thanks
Did you try using the flag to restrict the vocabulary?
On Thursday, July 14, 2016, lavelli [email protected] wrote:
I have noticed that at the end of March a new commit was performed. The commit is labeled "Enable >= 2^31 tokens in input data" so I thought it would have addressed the issue raised here. However, I still ran into an issue similar to the one mentioned by rasoolims. I'm able to successfully run the code only with a file containing 10M tokens (700K types). With bigger files it fails saying "core dump: segmentation fault". Any suggestion?
thanks
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/percyliang/brown-cluster/issues/14#issuecomment-232592159, or mute the thread https://github.com/notifications/unsubscribe/AGgz7eAylui6B0cmpcscgN284LJzMC8Oks5qVe3UgaJpZM4HpFcE .
Do you mean the min-occur flag? It seems to have an impact only on efficiency.
I know this is late and probably not important to OP anymore but for any other people facing the same issue, this pr fixed the issue for me.