rag-experiment-accelerator icon indicating copy to clipboard operation
rag-experiment-accelerator copied to clipboard

Providing specific format in config is not accepted

Open roel4ez opened this issue 1 year ago • 1 comments

The config.json file has a field for data_formats, which can have the value all, or a specific value for the format, such as html, docx, or pdf. This behavior is broken.

Expected behavior

Providing a specific value such as html or docx for the data_formats results in only those formats being loaded and indexed.

Current behavior

Providing a specific value such as docx for the data_formats results in no files being loaded and indexed, with the following logs showing up:

Loading documents from <my_folder>/data with allowed formats d, o, c, x
Format d is not supported
Format o is not supported
Format c is not supported
Format x is not supported

roel4ez avatar Dec 14 '23 13:12 roel4ez

The data_formats field accepts either "all" or an array of supported values.

ex: "data_formats": ["docx"]

This is not documented so we will add this to the backlog.

Supported values are:

  1. "data_formats": ["pdf", "html", "markdown", "json", "text", "docx"]
  2. "data_formats": "all"

joshuaphelpsms avatar Dec 15 '23 18:12 joshuaphelpsms