onyx icon indicating copy to clipboard operation
onyx copied to clipboard

Web Connector Miscount on GitHub Repository Documents

Open alex-feel opened this issue 1 year ago • 3 comments

I've encountered an issue with the Web connector when adding the following GitHub repository URL: https://github.com/fufesou/RustDeskIddDriver. Post addition, Danswer seems to be detecting over 3000 documents, a process that doesn't complete as I halted it midway, suspecting an anomaly since the repository does not contain that many files:

Screenshot

2023-11-07_1-27-59

Conversely, another Web connector set up with https://github.com/Microsoft/DMF identified only 132 documents, which appears to be accurate:

Screenshot

2023-11-07_1-27-33

Could there be a bug causing the Web connector to incorrectly parse and count documents from certain GitHub repositories? I believe this warrants investigation to ensure accuracy in document retrieval.

alex-feel avatar Nov 06 '23 23:11 alex-feel

Will investigate and get back to you shortly

yuhongsun96 avatar Nov 07 '23 23:11 yuhongsun96

So trying to pull in a repo like this will also pull in a lot of garbage, there are a HUGE number of URLs like: https://github.com/fufesou/RustDeskIddDriver/forks?include=active%2Carchived%2Cnetwork%2Cstarred&page=1&period=1y&sort_by=stargazer_counts

You could try using https://github.com/fufesou/RustDeskIddDriver/blob/main as the base, this would prevent that.

But even then, I'm not too sure if the code search quality will be that high. We're planning to build code search as its own feature in the future and it will be much better than indexing pages on github like this.

yuhongsun96 avatar Nov 08 '23 05:11 yuhongsun96

Thank you for the prompt response and the suggestion to use a more specific base URL to avoid pulling in extraneous data. I've noticed that the issue of pulling in unwanted URLs like fork pages does not occur with the Microsoft/DMF repository, which similarly has accessible URLs such as https://github.com/Microsoft/DMF/forks.

Regardless, I appreciate your team looking into this matter, and I am excited about the prospect of a dedicated code search feature. The ability to search through code within repositories will be a substantial enhancement to your service.

alex-feel avatar Nov 08 '23 06:11 alex-feel

Closing for now, will address pulling in code via code search sometime early this year

yuhongsun96 avatar Jan 11 '24 19:01 yuhongsun96