spark-lucenerdd icon indicating copy to clipboard operation
spark-lucenerdd copied to clipboard

Allow multithreaded indexing

Open zouzias opened this issue 7 years ago • 6 comments

Lucene's IndexWriter is thread-safe. So indexing should be multithreaded per executor.

zouzias avatar Nov 19 '16 23:11 zouzias

Is this still the case? If so , any chance you can point me in the right direction to try to implement this?

yeikel avatar Jan 29 '19 21:01 yeikel

I did an attempt in the past, but I didn't get any optimization.

If you want to try out, you need to change the following line of code

iterIndex.foreach { case elem =>
    // (implicitly) convert type T to Lucene document
    val doc = docConversion(elem)
    indexWriter.addDocument(FacetsConfig.build(taxoWriter, doc))
}

quoting from https://github.com/zouzias/spark-lucenerdd/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/partition/LuceneRDDPartition.scala#L69

to be executed in parallel. Feel free to make a PR if you manage to make any improvements.

zouzias avatar Jan 31 '19 22:01 zouzias

I did an attempt in the past, but I didn't get any optimization.

Did you use any script to monitor the performance of this? Or how did you benchmark it?

yeikel avatar Feb 04 '19 19:02 yeikel

I checked the time to index wikipedia and to my surprise using multi-threading didn't help at all. It even made things worse.

I don't have any numbers to share unfortunately. I can try to find the DataFrame of wikipedia articles if you are interested to have a second look at it.

zouzias avatar Feb 07 '19 21:02 zouzias

Yes please

yeikel avatar Feb 08 '19 02:02 yeikel

I have the data on AWS S3, if you have an account sent me you AWS account ID and I will give you access

zouzias avatar Feb 23 '19 21:02 zouzias