spark-lucenerdd icon indicating copy to clipboard operation
spark-lucenerdd copied to clipboard

Store query that links the documents

Open yeikel opened this issue 5 years ago • 3 comments

Is your feature request related to a problem? Please describe.

Currently , I generate the linker dynamically from the input. If the data for a particular field is not present , I don't include it in the linker.

When I retrieve the results , I would like to have an extra column that shows me what is the query that linked both documents together so I can debug and improve the queries if needed.

Due to the nature and complexity of Lucene (analyzers,tokens,filter,etc) , I believe this feature can help in the debugging process.

I am looking for something like this :

DataFrame = spark.createDataFrame(linkedResults
       .map { case (left, topDocs , query) =>

Or perhaps :

DataFrame = spark.createDataFrame(linkedResults
       .map { case (left, topDocs ) =>
topDocs.query

Describe alternatives you've considered

Logging is an option I can think of , but I would need to link them manually from the logs and this is not really possible or practical to do

Another option I can think of is to create a column and store the query before the execution of the linker.The only problem I have with this solution is that to do so I need to run the function that generates the linker twice : one when preparing the data (and exclude this column from the analysis) and another when the actual link execution happens. For a large input this could be very time consuming. This is a more realistic alternative and I believe I could implement it right now.

yeikel avatar Apr 01 '19 18:04 yeikel

@zouzias If you feel that this does not apply to the general usecase , please point me to see if I can implement it for mine.

yeikel avatar Apr 01 '19 18:04 yeikel

I think it is a nice to have feature. For now, you can compute the queries by just doing

lucenerdd.map(x => linkageFunction(x))

or instead of lucenerdd use the rdd that contains the "queries".

zouzias avatar Apr 11 '19 07:04 zouzias

If you want to work on it go ahead. One comment: should the type of query be a string or Query in

DataFrame = spark.createDataFrame(linkedResults
       .map { case (left, topDocs, query ) => topDocs.query }

zouzias avatar Apr 11 '19 07:04 zouzias