kuromoji icon indicating copy to clipboard operation
kuromoji copied to clipboard

Tokenizer is not serializable for Apache Spark

Open lamrongol opened this issue 9 years ago • 7 comments

On Apache Spark, instances must be serializable for parallel processing but kuromoji tokenizers are not and must be initialized for each time. If tokenizers are serializable, we can decrease processing time.

lamrongol avatar Nov 02 '15 12:11 lamrongol

Thanks a lot Fujikawa-san.

Instantiating Kuromoji takes a bit of time since it reads a fairly large dictionaries into memory. Could you clarify how making them serializable would help this in the context of Spark?

I just don't know the detailed mechanisms and I'd appreciate if you could explain. Thanks!

cmoen avatar Nov 02 '15 13:11 cmoen

Spark serialize whole class at the beginning and then process it by each machine parallelly. Therefore, if unserializable instance is contained it throws error, and you must initialize each time like following link http://www.intellilink.co.jp/article/column/bigdata-kk01.html

lamrongol avatar Nov 03 '15 00:11 lamrongol

I've tried to make kuromoji-core classes Serializable but been not to able to serialize Tokenizer because java.nio.HeapByteBuffer is unserializable. This work may take a lot of trouble

lamrongol avatar Nov 03 '15 13:11 lamrongol

This is changes I made(Sorry, unnecessary space diff included) https://github.com/lamrongol/kuromoji/commit/415e0fbc242d891e0708aaeacbb7a18ed478fee9

by using my tool https://github.com/lamrongol/MakeJavaClassSerializable

lamrongol avatar Nov 04 '15 02:11 lamrongol

I was looking into "Tuning Spark" document on Spark 1.2.0 and there is a section mentioning that using serialization will help reduce the memory usage on Spark. Perhaps Fujikawa-san is trying do something similar to it?

It is interesting that there is also a downside on this:

The only downside of storing data in serialized form is slower access times, due to having to deserialize each object on the fly.

akkikiki avatar Nov 04 '15 04:11 akkikiki

@akkikiki If not serializable, Spark doesn't work. https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html

By the way, I think we all understand Japanese and it's no problem to write in Japanese, isn't it? ところで、ここに書いてる人はみな日本語を理解してると思うので日本語で書いても問題ないのではないでしょうか?

lamrongol avatar Nov 04 '15 05:11 lamrongol

Sorry I'm not familiar to Kuromoji but I think Kuromoji reads dictionary file when processing and it is not suited to Serializable. If Kuromoji has new mode to contain all data in memory, it become Serializable, I think.

lamrongol avatar Nov 05 '15 14:11 lamrongol