featurebase icon indicating copy to clipboard operation
featurebase copied to clipboard

Importing into Pilosa via pilosa binary from a csv file

Open florianlechner opened this issue 5 years ago • 12 comments

For bugs, please provide the following:

What's going wrong?

Pilosa crashes and logs that stack trace

panic: runtime error: index out of range [3] with length 3

goroutine 458 [running]:
github.com/pilosa/pilosa/roaring.(*Container).runToArray(0xc9dac1d180, 0x2)
        /root/go/src/github.com/pilosa/pilosa/roaring/roaring.go:2770 +0x279
github.com/pilosa/pilosa/roaring.(*Container).optimize(0xc9dac1d180, 0xc9dac1d160)
        /root/go/src/github.com/pilosa/pilosa/roaring/roaring.go:2284 +0xfc
github.com/pilosa/pilosa/roaring.(*Bitmap).Optimize.func1(0x5f4c13b0, 0xc9dac1d180, 0xc545540001, 0xc9dac1d160, 0x2d3b0f01)
        /root/go/src/github.com/pilosa/pilosa/roaring/roaring.go:1006 +0x2b
github.com/pilosa/pilosa/roaring.(*enumerator).Every(0xd77e565580, 0xe54c78, 0xd77e565580, 0xc006b98000)
        /root/go/src/github.com/pilosa/pilosa/roaring/btree.go:922 +0xdd
github.com/pilosa/pilosa/roaring.(*bTreeContainers).UpdateEvery(0xc5455109c0, 0xe54c78)
        /root/go/src/github.com/pilosa/pilosa/roaring/containers_btree.go:231 +0x4f
github.com/pilosa/pilosa/roaring.(*Bitmap).Optimize(0xc545512640)
        /root/go/src/github.com/pilosa/pilosa/roaring/roaring.go:1005 +0x42
github.com/pilosa/pilosa/roaring.(*Bitmap).WriteTo(0xc545512640, 0xf50840, 0xd77e565540, 0xd35a2db000, 0xd2600ea368, 0x0)
        /root/go/src/github.com/pilosa/pilosa/roaring/roaring.go:1047 +0x2b
github.com/pilosa/pilosa.unprotectedWriteToFragment(0xc54553e000, 0xc545512640, 0x0, 0x0, 0x0)
        /root/go/src/github.com/pilosa/pilosa/fragment.go:2352 +0x4ed
github.com/pilosa/pilosa.(*fragment).snapshot(0xc54553e000, 0x4079db, 0xc00088f980)
        /root/go/src/github.com/pilosa/pilosa/fragment.go:2331 +0x57
github.com/pilosa/pilosa.(*fragment).protectedSnapshot(0xc54553e000, 0xc0007b9f01, 0x0, 0x0)
        /root/go/src/github.com/pilosa/pilosa/fragment.go:2318 +0x6a
github.com/pilosa/pilosa.snapshotQueueWorker(0xc00088f980, 0x7f2c8cdbc1c8, 0xc0002b8820)
        /root/go/src/github.com/pilosa/pilosa/fragment.go:197 +0x86
created by github.com/pilosa/pilosa.newSnapshotQueue
        /root/go/src/github.com/pilosa/pilosa/fragment.go:190 +0x7f

What was expected?

no crash and no stack trace

Steps to reproduce the behavior

pilosa import -i index -f field field_bits.csv

The source csv file can be provided. Please let me know.

Information about your environment (OS/architecture, CPU, RAM, cluster/solo, configuration, etc.)

Linux ORACLE 4.14.35-1902.302.2.el7uek.x86_64 #2 SMP Fri Apr 24 14:24:11 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux 2x AMD EPYC 7551 32-Core Processor 512 GB of RAM

single server instance

data-dir = "/pilosa-data/data/"
bind = "http://127.0.0.1:10101"
log-path = "/var/log/pilosa/pilosa.log"
max-map-count = 128000000
max-writes-per-request = 160000
max-file-count = 1000000

[cluster]
  coordinator = true
  replicas = 1
  long-query-time = "1m0s"

[anti-entropy]
  interval = "10m0s"

[gossip]
  port = 17000
  seeds = ["127.0.0.1:17000"]
  key = "/etc/pilosa/gossip.key32"

[metric]
  service = "prometheus"
  poll-interval = "0m10s"
  diagnostics = false

florianlechner avatar Jun 06 '20 11:06 florianlechner

@florianlechner what version (or commit hash) of pilosa are you running? Also, thanks for offering the source csv. That would definitely help if you're able to reference it here in this issue, email it ([email protected]) or any other method you prefer. Thanks!

travisturner avatar Jun 06 '20 12:06 travisturner

@travisturner Sorry for having forgotten the build version. Here is the output from Pilosa itself upon startup: Pilosa v1.4.0, build time 2020-04-24T17:39:46+0000. This was build from the git tag v1.4.0 or respectively commit a24482e44dc2acff750c0800d181e0294eb36fa7. It has a manual patch in to be able to load fragments > 4GB. manual back port of commit 827923e6163df633eeb9878732de7674df5acab4.

I'll send you the sample file via e-mail.

florianlechner avatar Jun 06 '20 12:06 florianlechner

@florianlechner thanks for the sample file. So far I've tried importing into v1.4.0 as well as v1.4.0 with 827923e6163df633eeb9878732de7674df5acab4 cherry-picked, and neither of those reproduce the error for me. (So far, I have only tried locally on MacOS, but I will try linux next). Even with the cherry-pick, my roaring/roaring.go doesn't seem to line up with that stack trace you provided. Is it possible that there's still a difference somewhere? Is there any way you could provide the actual code you're building from (even if it's just the roaring files) so that I can try to reproduce what you're seeing?

travisturner avatar Jun 06 '20 13:06 travisturner

@travisturner thanks for looking into it so quickly. Sorry, for not giving the full precondition. Just learned myself, by loaded the file to an empty Pilosa server myself that this alone can not trigger the error. It did work without an error. So I conclude that the error is triggered by the combination of existing information in the fragment and the bits to be set. I could provide you with the fragment file as well, but that one is 14GB in size.

florianlechner avatar Jun 06 '20 14:06 florianlechner

I'm happy to take the fragment file and try it out. Do you have it somewhere that I can download it? Or do you need me to provide somewhere for you to send it?

travisturner avatar Jun 06 '20 14:06 travisturner

@florianlechner thanks for sending that fragment file, but so far I haven't been able to reproduce that panic. Unfortunately, that stack trace indicates that it's happening in the optimize (runToArray()) code, which is going to require that the container is in the right state in order to trigger that panic.

Also, I'm still not exactly sure which line of code is causing the panic. Your stack trace indicates a slice reference at line 2770, but this is what my code looks like after cherry-picking 827923e6163df633eeb9878732de7674df5acab4 into v1.4.0:

2768     runs := c.runs()
2769
2770     array := make([]uint16, c.N())
2771     n := int32(0)
2772     for _, r := range runs {
2773         for v := int(r.start); v <= int(r.last); v++ {
2774             array[n] = uint16(v)
2775             n++
2776         }
2777     }

There's no slice reference anywhere near that line in my file. Are you able to share your roaring.go file?

And finally, I noticed that all of the values that I spot-checked in the import file are already set in the fragment file you provided, so I'm wondering if the import is not actually changing anything in that fragment (it will take me a little more time to confirm that).

travisturner avatar Jun 08 '20 02:06 travisturner

Just following up on this, it seems that the problem is likely happening on line 2774 in my example code above, which suggests that c.N() is returning a value that is lower than it should be. So, we'll need to figure out how N can get out of sync with the actual count that is based on the runs.

travisturner avatar Jun 08 '20 17:06 travisturner

@travisturner thanks for looking into it. I have also set up a new server with Pilosa and the fragment I shared with you. I can reproduce the bug repeatedly by splitting the source csv-file in half, till I come to point were I no longer can reproduce it. The best way for me to reproduce it consistently is with a 10 million line CSV file. I think it is the easiest if I e-mail you a link to the csv file.

florianlechner avatar Jun 08 '20 18:06 florianlechner

yeah, if you can send me that and describe what i need to do to reproduce it, then we should be able to track down the cause.

travisturner avatar Jun 08 '20 18:06 travisturner

ok, i was finally able to reproduce the panic somewhere around 40M lines into the csv file. thanks again for sending everything. i'll start working to narrow down the cause.

travisturner avatar Jun 09 '20 03:06 travisturner

@florianlechner I have narrowed it down a bit. There is a problematic container in that fragment file at key 1598821296 which looks like this:

c.N: 2
c.typ(): 3
runs: [{6928 22201}]

type 3 is a run container, and you can see the single run value there. The problem, I think, is that the value of N should not be 2, it should be 15274. I'm confirming with the team here, but if that's the case, we'll need to figure out how that value got set to 2 (or why it didn't get updated to 15274).

Just to put more info here, that key represents row 99926331, so the value in the import file that is triggering the panic is:

99926331,3

When the optimizer runs, it's seeing that N=3 (because it was 2, and this import line added 1), so it's trying to convert the runs to an array of 3 values. But there are more than 3 values represented by runs [{3 3}, {6928 22201}]. And really, it shouldn't even be trying to do this conversion because that container should remain a run container.

I'll dig in more on this side, but if you have any information that could help us figure out how that fragment file got into the state that it did (or if you're able to help us recreate it from raw data), that would probably be the next step.

travisturner avatar Jun 10 '20 04:06 travisturner

That analysis seems right to me, so yeah, the question then becomes why N is recorded wrong there.

seebs avatar Jun 10 '20 16:06 seebs