hound icon indicating copy to clipboard operation
hound copied to clipboard

Index is apparently limited to 4 GB

Open moio opened this issue 4 years ago • 7 comments

:wave: Hound developers!

I am trying to index a pretty large repo (144GB - all current sources of openSUSE), and unsurprisingly the index turns out to be larger than 4GB, thus I hit this fatal message:

https://github.com/hound-search/hound/blob/e3b1b43eb872e47af1de1b0e15d0ec3ac5c51dc4/codesearch/index/write.go#L561

Would it be possible / how hard would it be to support larger indexes?

I only had a brief look at read.go and it seems to me that 32 bit offsets are part of the index file format, so changing that would require re-indexing/converting/supporting two file formats, is that correct?

Thanks for all your efforts on Hound!

moio avatar Oct 01 '20 06:10 moio

Oh yikes, I'm sorry you're running into that. It does look like this would involve supporting / moving to a 64-bit-based index file. That's not work we have slated, but I think a PR would be appreciated. We're actively running it on what I thought was a large repository, but it looks like the repo itself is only about 8 gigs.

salemhilal avatar Oct 01 '20 14:10 salemhilal

This seems to be an important issue.

rfan-debug avatar Oct 27 '20 16:10 rfan-debug

@rfan-debug it's likely something we'll have to do at some point. Are you interested in tackling it?

salemhilal avatar Oct 28 '20 17:10 salemhilal

I think giving it a fix is not difficult. However, I am not sure how to test it reliably if i change any code.

It seems that we don't have sufficient integration test..

rfan-debug avatar Oct 28 '20 17:10 rfan-debug

I think that's part of what makes this issue tricky. If you're willing to write unit or integration tests, I'd definitely welcome that as well.

salemhilal avatar Oct 29 '20 22:10 salemhilal

I think the unit test is sufficient for the current codesearch but we lack real integration test.

I skimmed over the codesearch. I found the root cause of the 4GB limit is from the data type uint32 everywhere. I need some time to check all the places of where uint32 is used for indexing and replace it with uint64. Certainly, we need to add some functionality on the bit operations on 64-bit data types.

Now i think a good way to build up the integration test set is:

  • Use the current 32-bit code to build a code search system on a codebase (e.g. hound itself)
  • Add 100 example queries and record its results.
  • Migrate the datatype in codesearch from 32bit to 64bit
  • Verify the 100 example query's result on the new system.

rfan-debug avatar Nov 02 '20 01:11 rfan-debug

I gave it a shot because I also thought that it would be straight forward but it's more difficult than expected.

The biggest hurdle is that the index size is tightly bound to the max-size of an array/slice. So a 64bit-sized index couldn't directly be mapped to a []byte because the max-size of an array/slice is MaxInt32. And with that the quite complex operations on slices need a migration.

I think a better approach would be to have the ability to have different backend implementations for the Index-Type. E.g. I could imagine that an implementation with an SQLite or bbolt backend would be quite easy and would automatically support very large index files.

Urmeli0815 avatar Jan 01 '21 21:01 Urmeli0815