spark-lucenerdd
spark-lucenerdd copied to clipboard
Allow multithreaded indexing
Lucene's IndexWriter
is thread-safe. So indexing should be multithreaded per executor.
Is this still the case? If so , any chance you can point me in the right direction to try to implement this?
I did an attempt in the past, but I didn't get any optimization.
If you want to try out, you need to change the following line of code
iterIndex.foreach { case elem =>
// (implicitly) convert type T to Lucene document
val doc = docConversion(elem)
indexWriter.addDocument(FacetsConfig.build(taxoWriter, doc))
}
quoting from https://github.com/zouzias/spark-lucenerdd/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/partition/LuceneRDDPartition.scala#L69
to be executed in parallel. Feel free to make a PR if you manage to make any improvements.
I did an attempt in the past, but I didn't get any optimization.
Did you use any script to monitor the performance of this? Or how did you benchmark it?
I checked the time to index wikipedia and to my surprise using multi-threading didn't help at all. It even made things worse.
I don't have any numbers to share unfortunately. I can try to find the DataFrame of wikipedia articles if you are interested to have a second look at it.
Yes please
I have the data on AWS S3, if you have an account sent me you AWS account ID and I will give you access