spark-lucenerdd
spark-lucenerdd copied to clipboard
Store query that links the documents
Is your feature request related to a problem? Please describe.
Currently , I generate the linker
dynamically from the input. If the data for a particular field is not present , I don't include it in the linker.
When I retrieve the results , I would like to have an extra column that shows me what is the query that linked both documents together so I can debug and improve the queries if needed.
Due to the nature and complexity of Lucene (analyzers,tokens,filter,etc) , I believe this feature can help in the debugging process.
I am looking for something like this :
DataFrame = spark.createDataFrame(linkedResults
.map { case (left, topDocs , query) =>
Or perhaps :
DataFrame = spark.createDataFrame(linkedResults
.map { case (left, topDocs ) =>
topDocs.query
Describe alternatives you've considered
Logging is an option I can think of , but I would need to link them manually from the logs and this is not really possible or practical to do
Another option I can think of is to create a column and store the query before the execution of the linker.The only problem I have with this solution is that to do so I need to run the function that generates the linker twice : one when preparing the data (and exclude this column from the analysis) and another when the actual link execution happens. For a large input this could be very time consuming. This is a more realistic alternative and I believe I could implement it right now.
@zouzias If you feel that this does not apply to the general usecase , please point me to see if I can implement it for mine.
I think it is a nice to have feature. For now, you can compute the queries by just doing
lucenerdd.map(x => linkageFunction(x))
or instead of lucenerdd
use the rdd that contains the "queries".
If you want to work on it go ahead. One comment: should the type of query be a string or Query
in
DataFrame = spark.createDataFrame(linkedResults
.map { case (left, topDocs, query ) => topDocs.query }