anything-llm icon indicating copy to clipboard operation
anything-llm copied to clipboard

Added an option to fetch issues from gitlab. Made the file fetching a…

Open blazeyo opened this issue 5 months ago • 1 comments

…synchornous to improve performance. #2334

Pull Request Type

  • [x] ✨ feat
  • [ ] 🐛 fix
  • [x] ♻️ refactor
  • [ ] 💄 style
  • [ ] 🔨 chore
  • [ ] 📝 docs

Relevant Issues

connect #812 resolves #2334

What is in this change?

  • New "Fetch Issues" Checkbox: Adds an option on the GitLab connector page to fetch all project issues, including associated discussion items (such as comments, assignee changes, etc.).

    Selection_223

  • New fetchNextPage Method: Implements a fetchNextPage method in GitlabRepoLoader to streamline the process of fetching all pages from an API endpoint in a more generic and reusable way.

  • Refactoring: Refactors the getRepoBranches and fetchFilesRecursive methods to utilize the new fetchNextPage logic.

  • Speed Improvements: File fetching is now performed in parallel, resulting in a substantial performance boost—improving speed by an order of magnitude.

Additional Information

There are a few areas that would benefit from further discussion:

  1. EDIT: no longer current. The issues are converted to markdown now.
  2. Page Size Configuration
  • Concurrent fetching significantly boosts performance, but it can strain system resources, particularly if GitLab is hosted on a less powerful server.
  • During testing on a GitLab instance with 8 cores and 16GB of RAM, I fetched a repository containing 6.5k files and 1.7k issues (with up to 150 discussion items each) using 100 items per page. While this worked well, the server's average 5-minute load reached 10.
  • It might be worth considering making the pageSize parameter configurable to allow for smaller page sizes (e.g., 25 items per page) on less capable servers.
  1. Chunk Source & Repository URL
  • Currently, the generateChunkSource function does not include the repository URL in its payload. This might be necessary for the "Automatic Document Content Sync" feature, particularly for self-hosted GitLab instances.
  • If the repo URL is indeed required for this feature to work, I am happy to open an issue and submit a separate PR to address this.

Developer Validations

  • [x] I ran yarn lint from the root of the repo & committed changes
  • [x] Relevant documentation has been updated
  • [x] I have tested my code functionality
  • [x] Docker build succeeds locally

blazeyo avatar Sep 20 '24 17:09 blazeyo