spring-ai
spring-ai copied to clipboard
Add text splitter to Tika document reader
This makes it possible for TikaDocumentReader to read large PDFs.
Added a simple setter for chunk size, still defaulting to 800. I considered other options...
- Yet another constructor argument. Decided against this because
TikaDocumentReaderalready has a lot of constructors and generally, I favor constructors for things that must be set and setters for things that can optionally be set (as is the case here). - A separate configuration class to carry this property (and potentially other properties to configure the reader). I decided against this because at the moment there's only this proper to be set.
I still kinda think that TextSplitter's splitText() method could be public. If it were, then TikaDocumentReader could be injected with a TextSplitter and simply call splitText() without having to concern itself with the chunk size. Chunk size would be solely a TextSplitter concern. It'd also open up the possibility of other TextSplitter implementations to be used. In short...I'd likely do the following:
- Make
TextSplitter'ssplitText()method public. - Change
TokenTextSplitterto not hardcode 800 as the chunk size and allow it to be configured. (While maybe still defaulting to 800.) - Either create a no-op implementation of
TextSplitteras the defaultTextSplitterthatTikaDocumentReaderuses or simply use the extracted text as-is if noTextSplitterhas been specified forTikaDocumentReader.
This was a bit removed from my goal of making TikaDocumentReader capable of handling large PDFs, and would be a more involved change that I'd rather get feedback on before I applied it. Instead I chose the more direct approach in this PR. But, I'll be happy to alter this PR (or close it and open a new PR) with the TextSplitter changes described above if you think that'd be a good change.
I did a bit of research and found this https://stackoverflow.com/questions/31079433/how-to-read-large-files-using-tika
It seems like changing
public TikaDocumentReader(Resource resource, ExtractedTextFormatter textFormatter) {
this(resource, new BodyContentHandler(), textFormatter);
}
to
public TikaDocumentReader(Resource resource, ExtractedTextFormatter textFormatter) {
this(resource, new BodyContentHandler(-1), textFormatter);
}
Will remove the limitation you encountered. I think getting the full text in the TikaDocumentReader and then composing/chaining the output of TikaDocumentReader the input of TokenTextSplitter gives full control. If a user wants to have a limit inside Tika, then passing in a property configured BodyContentHandler instance would achieve that goal.
Chaining functions in the "ETL Pipeline" vs. having TokenTextSplitter as a property of TikaDocumentReader gives the most flexibility in the spirit of having a Java based functional pipeline. The same approach applies to anyone using any other implementation of DocumentReader.
@habuma I'd like to close this, please read my last comment.
closing as adding a text splitter to the functional composition chain achieves the request.