brown-cluster icon indicating copy to clipboard operation
brown-cluster copied to clipboard

Is there any limit for the vocab size (#types)?

Open rasoolims opened this issue 8 years ago • 5 comments

The code fails (with core dump: segmentation fault message) when I run it on a huge txt file (about 20M types and 14GB file size). I already used wcluster for different files with much less types and it worked pretty well.

Is there any limit for the vocabulary size (#types)?

rasoolims avatar Mar 04 '16 05:03 rasoolims

I'm not sure what the exact limit is but I'm not surprised that it failed with 20M types. You can try using the restrict command line option to restrict it to a smaller vocabulary. The brown clustering algorithm dates back to a time when people didn't have 14GB text files to work with.

On Thu, Mar 3, 2016 at 9:04 PM, Mohammad Sadegh Rasooli < [email protected]> wrote:

The code fails (with core dump: segmentation fault message) when I run it on a huge txt file (about 20M types and 14GB file size). I already used wcluster for different files with much less types and it worked pretty well.

Is there any limit for the vocabulary size (#types)?

— Reply to this email directly or view it on GitHub https://github.com/percyliang/brown-cluster/issues/14.

ajaech avatar Mar 04 '16 17:03 ajaech

I have noticed that at the end of March a new commit was performed. The commit is labeled "Enable >= 2^31 tokens in input data" so I thought it would have addressed the issue raised here. However, I still ran into an issue similar to the one mentioned by rasoolims. I'm able to successfully run the code only with a file containing 10M tokens (700K types). With bigger files it fails saying "core dump: segmentation fault". Any suggestion?

thanks

lavelli avatar Jul 14 '16 08:07 lavelli

Did you try using the flag to restrict the vocabulary?

On Thursday, July 14, 2016, lavelli [email protected] wrote:

I have noticed that at the end of March a new commit was performed. The commit is labeled "Enable >= 2^31 tokens in input data" so I thought it would have addressed the issue raised here. However, I still ran into an issue similar to the one mentioned by rasoolims. I'm able to successfully run the code only with a file containing 10M tokens (700K types). With bigger files it fails saying "core dump: segmentation fault". Any suggestion?

thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/percyliang/brown-cluster/issues/14#issuecomment-232592159, or mute the thread https://github.com/notifications/unsubscribe/AGgz7eAylui6B0cmpcscgN284LJzMC8Oks5qVe3UgaJpZM4HpFcE .

ajaech avatar Jul 14 '16 15:07 ajaech

Do you mean the min-occur flag? It seems to have an impact only on efficiency.

lavelli avatar Jul 15 '16 08:07 lavelli

I know this is late and probably not important to OP anymore but for any other people facing the same issue, this pr fixed the issue for me.

jndevanshu avatar Jul 20 '18 19:07 jndevanshu