haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Adding splitting information in the metadata of DocumentSplitter output

Open tradicio opened this issue 1 year ago • 6 comments

Is your feature request related to a problem? Please describe. When splitting a document in Haystack v1, the function _create_docs_from_splits() within the class PreProcessor() is able to save relevant metadata information such as _split_id and _split_overlap. So far, the class DocumentSplitter in Haystack v2 is not able to do the same. I think that it could be useful to bring these metadata information in the final output of DocumentSplitter.

Describe the solution you'd like In the run() method, the solution could be to reproduce an adapted version of _create_docs_from_splits() as reported in the PreProcessor() class in Haystack v1.

Additional context To check the differences between the outputs of the two Haystack versions, you need to realize an indexing pipeline in Haystack v2 and compare it with the output resulting from the PreProcessor of Haysatck v1

tradicio avatar Mar 20 '24 15:03 tradicio

@tradicio are you using this information in your application? Please explain your use case to understand how we can support it, possibly reintroducing this information.

anakin87 avatar Mar 22 '24 15:03 anakin87

Hi @anakin87, thanks for your message!

You're correct, I am using the _split_id and _split_overlap in my application in order to keep the information of the chunks order resulting from PreProcessor(). This feature allows textual chunks to be displayed as they are in the DocumentStore() and is very useful for checking how the information from a long text is divided and ordered.

With the introduction of Haystack v2 in the application, I would like to keep this functionality to continue to display the sorted output from DocumentSplitter() class.

Let me know if you need further clarification, I am really glad to contribute!

tradicio avatar Mar 27 '24 14:03 tradicio

Ok, I understand!

Let's involve @julian-risch for an opinion since he worked on these components.

anakin87 avatar Mar 27 '24 15:03 anakin87

@tradicio Thanks for the feedback, I agree we should add these advanced metadata of _split_id and _split_overlap to a next iteration of DocumentSplitter, yes. 👍

julian-risch avatar Mar 27 '24 15:03 julian-risch

@tradicio If you feel like it, go ahead and try to create a PR.

anakin87 avatar Mar 27 '24 15:03 anakin87

@tradicio If you feel like it, go ahead and try to create a PR.

Sure, I'll try to create a PR in the next days! Thanks for all your support, I am really glad to give my contribution

tradicio avatar Mar 28 '24 09:03 tradicio