onyx icon indicating copy to clipboard operation
onyx copied to clipboard

Title and Link Metadata for PDFs via Web and File Connectors

Open alex-feel opened this issue 1 year ago • 1 comments

I frequently need to add online PDFs to Danswer and encounter limitations with both the Web and File connectors:

  1. The Web connector doesn't display a proper title in the sourced list, only showing pdf.
  2. With the File connector, it's cumbersome to embed the source link via #DANSWER_METADATA={"link": "<LINK>"} directly into the file for it to appear in the sourced list.

I propose enhancements for both connectors:

  1. For the Web connector, an optional field to manually input the page title should be added. If a manual title is provided, it should override the auto-detected title, with configurable priority settings.
  2. For the File connector, an optional field to input the source link <LINK> should be introduced, eliminating the need to edit the PDF file directly. The manual link should have priority over any detected link within the file's metadata, with configurable settings.

Additionally, post-implementation, it will be essential to provide an edit feature for both connectors, enabling users to modify titles/links as needed.

This functionality is crucial for maintaining the accuracy and relevance of the sourced entries and will enhance the usability of sourced lists in Danswer.

alex-feel avatar Nov 05 '23 00:11 alex-feel

I haven't had time to clean it up, but I have a local branch that I updated to set the semantic_identifier of the web connecter PDFs to the slug of current page. I can work on getting that into a branch for a PR. The hang up is another change I needed to make to the web connector to handle authenticated requests to an internal webpage. I should just be able to cherry-pick the pdf change and open a PR, but I might have been sloppy with my git check-in and need to rebase it into its own commit.

sjakos avatar Nov 07 '23 22:11 sjakos