hound
hound copied to clipboard
Index is apparently limited to 4 GB
:wave: Hound developers!
I am trying to index a pretty large repo (144GB - all current sources of openSUSE), and unsurprisingly the index turns out to be larger than 4GB, thus I hit this fatal message:
https://github.com/hound-search/hound/blob/e3b1b43eb872e47af1de1b0e15d0ec3ac5c51dc4/codesearch/index/write.go#L561
Would it be possible / how hard would it be to support larger indexes?
I only had a brief look at read.go
and it seems to me that 32 bit offsets are part of the index file format, so changing that would require re-indexing/converting/supporting two file formats, is that correct?
Thanks for all your efforts on Hound!
Oh yikes, I'm sorry you're running into that. It does look like this would involve supporting / moving to a 64-bit-based index file. That's not work we have slated, but I think a PR would be appreciated. We're actively running it on what I thought was a large repository, but it looks like the repo itself is only about 8 gigs.
This seems to be an important issue.
@rfan-debug it's likely something we'll have to do at some point. Are you interested in tackling it?
I think giving it a fix is not difficult. However, I am not sure how to test it reliably if i change any code.
It seems that we don't have sufficient integration test..
I think that's part of what makes this issue tricky. If you're willing to write unit or integration tests, I'd definitely welcome that as well.
I think the unit test is sufficient for the current codesearch
but we lack real integration test.
I skimmed over the codesearch. I found the root cause of the 4GB limit is from the data type uint32
everywhere. I need some time to check all the places of where uint32
is used for indexing and replace it with uint64
. Certainly, we need to add some functionality on the bit operations on 64-bit data types.
Now i think a good way to build up the integration test set is:
- Use the current 32-bit code to build a code search system on a codebase (e.g. hound itself)
- Add 100 example queries and record its results.
- Migrate the datatype in
codesearch
from 32bit to 64bit - Verify the 100 example query's result on the new system.
I gave it a shot because I also thought that it would be straight forward but it's more difficult than expected.
The biggest hurdle is that the index size is tightly bound to the max-size of an array/slice. So a 64bit-sized index couldn't directly be mapped to a []byte
because the max-size of an array/slice is MaxInt32
. And with that the quite complex operations on slices need a migration.
I think a better approach would be to have the ability to have different backend implementations for the Index-Type. E.g. I could imagine that an implementation with an SQLite or bbolt backend would be quite easy and would automatically support very large index files.