feat: MarkdownHeaderSplitter
Proposed Changes:
Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers
How did you test it?
unit tests
Checklist
- I have read the contributors guidelines and the code of conduct
- I have updated the related issue with new insights and changes
- I added unit tests and updated the docstrings
- I've used one of the conventional commit types for my PR title:
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes. - I documented my code
- I ran pre-commit hooks and fixed any issue
@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the DocumentSplitter https://github.com/deepset-ai/haystack/blob/main/test/components/preprocessors/test_document_splitter.py
This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples.
Thanks for your continued work on this @OGuggenbuehl!
Some general comments. Could you:
- Add a release note for this PR following the instructions here
- Could you make sure to include our license header to the beginning of each file you've added. You can find an example of the license header here
- Please make sure to sign the CLA agreement (docs about it here) from this comment
- If you haven't already please also set up pre-commit hooks using
pre-commit install. You can find more info about that in this section of our contribution guidelines. - Also in the future feel free to open branches directly in Haystack instead of using a fork. This makes it slightly easier to pull down your code to review locally.
Pull Request Test Coverage Report for Build 19816136481
Details
- 0 of 0 changed or added relevant lines in 0 files are covered.
- 3 unchanged lines in 1 file lost coverage.
- Overall coverage remained the same at 92.189%
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| core/pipeline/async_pipeline.py | 3 | 65.88% |
| <!-- | Total: | 3 |
| Totals | |
|---|---|
| Change from base Build 19775495543: | 0.0% |
| Covered Lines: | 14174 |
| Relevant Lines: | 15375 |
💛 - Coveralls
Thank you for your review, @sjrl ! I resolved the remaining comments and turned this from draft to proper PR. note: pre-commit added some fixes to pre-existing files I did not touch otherwise - not sure if this is wanted
@sjrl I have been thinking about whether keeping _infer_header_levels as a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?
@sjrl I have been thinking about whether keeping
_infer_header_levelsas a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?
Good question! It does sound different from what our other splitters do and almost fits better into the DocumentCleaner abstraction that we have. If you find that you don't use it often and/or that it doesn't always work well I could see it as a separate component that we could add to https://github.com/deepset-ai/haystack-experimental first so we can easily make breaking changes to it and gather feedback before fully committing to it. What do you think?
@sjrl I have been thinking about whether keeping
_infer_header_levelsas a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?Good question! It does sound different from what our other splitters do and almost fits better into the
DocumentCleanerabstraction that we have. If you find that you don't use it often and/or that it doesn't always work well I could see it as a separate component that we could add to https://github.com/deepset-ai/haystack-experimental first so we can easily make breaking changes to it and gather feedback before fully committing to it. What do you think?
I feel that this would be a more appropriate approach - also because it would improve the separation of concerns by component. I suggest I:
- remove the header-level inference functionality from this component
- re-implement it as its own component and put up a PR in haystack-experimental
@OGuggenbuehl that sounds good!
@sjrl done! MarkdownHeaderLevelsInferrer now lives here and is entirely removed from this PR
@OGuggenbuehl is attempting to deploy a commit to the deepset Team on Vercel.
A member of the Team first needs to authorize it.