dify icon indicating copy to clipboard operation
dify copied to clipboard

feat: update notion extractor

Open badbye opened this issue 1 year ago • 2 comments

Description

see this post: https://github.com/langgenius/dify/discussions/3883

This PR combine all the blocks of a notion page into a single document. Headings are converted to the markdown style, so that user can use customized splitter to split it into chunks. For example, \n## can be used to split by h2 (It may not work if there is a code block and comment in the page).

Anyway, from my experience, too many chunks and less content in each chunk result in poorly performance. This PR could make it better.

Type of Change

  • [X] Improvement, including but not limited to code refactoring, performance optimization, and UI/UX improvement

How Has This Been Tested?

A test script was added.

Suggested Checklist:

  • [X] I have performed a self-review of my own code
  • [X] I have commented my code, particularly in hard-to-understand areas
  • [X] My changes generate no new warnings
  • [X] I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods
  • [X] optional I have added tests that prove my fix is effective or that my feature works
  • [X] optional New and existing unit tests pass locally with my changes

badbye avatar Apr 26 '24 11:04 badbye

We will add more different splitter rules for user to choose in our roadmap , becauser no one is better than the other but just when one fits more in certain type of questions.

JohnJyong avatar Apr 29 '24 06:04 JohnJyong

We will add more different splitter rules for user to choose in our roadmap , becauser no one is better than the other but just when one fits more in certain type of questions.

Totally agree. This PR is actually not about the splitter. It is just trying to combine the blocks into a single document, then you can use a customized splitter to split the page. You mean you will not consider it before more split rules are added?

badbye avatar Apr 29 '24 08:04 badbye