onyx
onyx copied to clipboard
Web Connector Miscount on GitHub Repository Documents
I've encountered an issue with the Web connector when adding the following GitHub repository URL: https://github.com/fufesou/RustDeskIddDriver. Post addition, Danswer seems to be detecting over 3000 documents, a process that doesn't complete as I halted it midway, suspecting an anomaly since the repository does not contain that many files:
Screenshot
Conversely, another Web connector set up with https://github.com/Microsoft/DMF identified only 132 documents, which appears to be accurate:
Screenshot
Could there be a bug causing the Web connector to incorrectly parse and count documents from certain GitHub repositories? I believe this warrants investigation to ensure accuracy in document retrieval.
Will investigate and get back to you shortly
So trying to pull in a repo like this will also pull in a lot of garbage, there are a HUGE number of URLs like: https://github.com/fufesou/RustDeskIddDriver/forks?include=active%2Carchived%2Cnetwork%2Cstarred&page=1&period=1y&sort_by=stargazer_counts
You could try using https://github.com/fufesou/RustDeskIddDriver/blob/main as the base, this would prevent that.
But even then, I'm not too sure if the code search quality will be that high. We're planning to build code search as its own feature in the future and it will be much better than indexing pages on github like this.
Thank you for the prompt response and the suggestion to use a more specific base URL to avoid pulling in extraneous data. I've noticed that the issue of pulling in unwanted URLs like fork pages does not occur with the Microsoft/DMF repository, which similarly has accessible URLs such as https://github.com/Microsoft/DMF/forks.
Regardless, I appreciate your team looking into this matter, and I am excited about the prospect of a dedicated code search feature. The ability to search through code within repositories will be a substantial enhancement to your service.
Closing for now, will address pulling in code via code search sometime early this year