spring-ai icon indicating copy to clipboard operation
spring-ai copied to clipboard

Introduce MarkdownDocumentReader

Open piotrooo opened this issue 1 year ago • 2 comments

Motivation

@markpollack and @tzolov, maybe you are interested in the new MarkdownDocumentReader, which can read structured Markdown documents. As @markpollack wrote in #105, it could be valuable. I agree.

So, I've prepared a simple implementation of that DocumentReader.

Description

For parsing Markdown documents, I've used the commonmark/commonmark-java library.

Document dividing

By default, all documents are divided by headers. This includes all header types from 1 to 6. For a simple document like:

# AAA
content 1

## BBB
content 2

### CCC
content 3

#### DDD
content 4

##### EEE
content 5

###### FFF
content 6

Six documents will be generated. Each of these documents will have entries in the metadata as follows:

  • category => header_X, where X is the number of the header
  • title => <header title>, e.g.: BBB from the example

There is also an option to divide the Markdown document by horizontal lines. This is not the default option, but it can be turned on through configuration.

Blockquotes and Code Blocks support

All blockquotes and code blocks are treated as separate documents. For code blocks where we the language can be determined, it is included in the lang metadata entry.

This behavior can be changed by setting options.

Additional metadata

The Markdown reader configuration also provides support for additional metadata, which may be set for all processed documents. It contains fixed values that offer more context about the created document, such as the service name that provides the document, or the environment in which it was created.

TODO

piotrooo avatar Jul 24 '24 07:07 piotrooo