kuromoji
kuromoji copied to clipboard
Tokenizer is not serializable for Apache Spark
On Apache Spark, instances must be serializable for parallel processing but kuromoji tokenizers are not and must be initialized for each time. If tokenizers are serializable, we can decrease processing time.
Thanks a lot Fujikawa-san.
Instantiating Kuromoji takes a bit of time since it reads a fairly large dictionaries into memory. Could you clarify how making them serializable would help this in the context of Spark?
I just don't know the detailed mechanisms and I'd appreciate if you could explain. Thanks!
Spark serialize whole class at the beginning and then process it by each machine parallelly. Therefore, if unserializable instance is contained it throws error, and you must initialize each time like following link http://www.intellilink.co.jp/article/column/bigdata-kk01.html
I've tried to make kuromoji-core classes Serializable but been not to able to serialize Tokenizer because java.nio.HeapByteBuffer is unserializable. This work may take a lot of trouble
This is changes I made(Sorry, unnecessary space diff included) https://github.com/lamrongol/kuromoji/commit/415e0fbc242d891e0708aaeacbb7a18ed478fee9
by using my tool https://github.com/lamrongol/MakeJavaClassSerializable
I was looking into "Tuning Spark" document on Spark 1.2.0 and there is a section mentioning that using serialization will help reduce the memory usage on Spark. Perhaps Fujikawa-san is trying do something similar to it?
It is interesting that there is also a downside on this:
The only downside of storing data in serialized form is slower access times, due to having to deserialize each object on the fly.
@akkikiki If not serializable, Spark doesn't work. https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
By the way, I think we all understand Japanese and it's no problem to write in Japanese, isn't it? ところで、ここに書いてる人はみな日本語を理解してると思うので日本語で書いても問題ないのではないでしょうか?
Sorry I'm not familiar to Kuromoji but I think Kuromoji reads dictionary file when processing and it is not suited to Serializable. If Kuromoji has new mode to contain all data in memory, it become Serializable, I think.