spark-lucenerdd
spark-lucenerdd copied to clipboard
Typesafe config is generating the error UTFDataFormatException: encoded string too long
I noticed that we are using Typesafe config and that seems to be introducing serialization issues to the job as they are failing with the following exception :
Caused by: java.io.UTFDataFormatException: encoded string too long: 72887 bytes
The issue is hard to replicate and all I can provide at the moment are the stack traces. I will update the issue if I find a way to replicate it
Do you have any recommendation to deal with this issue?
Similar issue : https://stackoverflow.com/questions/41505599/task-not-serializable-in-spark-caused-by-utfdataformatexception-encoded-string
This looks like a weird issue.
AFAIR, the typesafe configs for LuceneRDD
do not need to be serializable. If you use the typesafe config in your application make sure you use it within an object
so that it is available to both driver and executors.
You can extend this trait https://github.com/zouzias/spark-lucenerdd/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/config/Configurable.scala#L24 to get it working.
It really does.
I am not using typesafe configs on my own application. The exception is coming from LuceneRDD
itself.
I did another build removing all the references to it in LuceneRDD
and it is working fine. I obviously miss the capability to add dynamic configurations so that's not a good solution.