spring-ai icon indicating copy to clipboard operation
spring-ai copied to clipboard

Add text splitter to Tika document reader

Open habuma opened this issue 1 year ago • 2 comments
trafficstars

This makes it possible for TikaDocumentReader to read large PDFs.

habuma avatar Dec 13 '23 22:12 habuma

Added a simple setter for chunk size, still defaulting to 800. I considered other options...

  • Yet another constructor argument. Decided against this because TikaDocumentReader already has a lot of constructors and generally, I favor constructors for things that must be set and setters for things that can optionally be set (as is the case here).
  • A separate configuration class to carry this property (and potentially other properties to configure the reader). I decided against this because at the moment there's only this proper to be set.

I still kinda think that TextSplitter's splitText() method could be public. If it were, then TikaDocumentReader could be injected with a TextSplitter and simply call splitText() without having to concern itself with the chunk size. Chunk size would be solely a TextSplitter concern. It'd also open up the possibility of other TextSplitter implementations to be used. In short...I'd likely do the following:

  • Make TextSplitter's splitText() method public.
  • Change TokenTextSplitter to not hardcode 800 as the chunk size and allow it to be configured. (While maybe still defaulting to 800.)
  • Either create a no-op implementation of TextSplitter as the default TextSplitter that TikaDocumentReader uses or simply use the extracted text as-is if no TextSplitter has been specified for TikaDocumentReader.

This was a bit removed from my goal of making TikaDocumentReader capable of handling large PDFs, and would be a more involved change that I'd rather get feedback on before I applied it. Instead I chose the more direct approach in this PR. But, I'll be happy to alter this PR (or close it and open a new PR) with the TextSplitter changes described above if you think that'd be a good change.

habuma avatar Dec 14 '23 05:12 habuma

I did a bit of research and found this https://stackoverflow.com/questions/31079433/how-to-read-large-files-using-tika

It seems like changing

	public TikaDocumentReader(Resource resource, ExtractedTextFormatter textFormatter) {
		this(resource, new BodyContentHandler(), textFormatter);
	}

to

	public TikaDocumentReader(Resource resource, ExtractedTextFormatter textFormatter) {
		this(resource, new BodyContentHandler(-1), textFormatter);
	}

Will remove the limitation you encountered. I think getting the full text in the TikaDocumentReader and then composing/chaining the output of TikaDocumentReader the input of TokenTextSplitter gives full control. If a user wants to have a limit inside Tika, then passing in a property configured BodyContentHandler instance would achieve that goal.

Chaining functions in the "ETL Pipeline" vs. having TokenTextSplitter as a property of TikaDocumentReader gives the most flexibility in the spirit of having a Java based functional pipeline. The same approach applies to anyone using any other implementation of DocumentReader.

markpollack avatar Dec 20 '23 21:12 markpollack

@habuma I'd like to close this, please read my last comment.

markpollack avatar Jan 25 '24 22:01 markpollack

closing as adding a text splitter to the functional composition chain achieves the request.

markpollack avatar Feb 29 '24 15:02 markpollack