t-digest icon indicating copy to clipboard operation
t-digest copied to clipboard

Index out of bounds in MergingDigest

Open death opened this issue 6 years ago • 27 comments

Hi,

The following variant of the testSmallCountQuantile test results in a java.lang.ArrayIndexOutOfBoundsException error. Dropping the last element will have the test pass.

@Test
public void testSmallCountQuantile2() {
    List<Double> data = Lists.newArrayList(3.0, 3.1, 3.0, 3.0,
                                           3.0, 3.0, 3.0, 3.0,
                                           3.0, 3.0, 3.0, 3.0,
                                           3.5, 3.3, 3.4, 3.5,
                                           3.5, 4.0, 2.95, 3.18,
                                           180.0, 3.0, 3.0, 3.0
                                           , 3.0
                                           );
    TDigest td = new MergingDigest(1);
    for (Double datum : data) {
        td.add(datum);
    }
}

death avatar May 06 '18 18:05 death

this may be out of date; i just ran it into current master and it passes.

dariomx avatar Sep 05 '18 23:09 dariomx

Thanks for checking.

(and what a huge relief)

On Wed, Sep 5, 2018 at 4:00 PM dariomx [email protected] wrote:

this may be out of date; i just ran it into current master and it passes.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/107#issuecomment-418908639, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPSeivTS7IIUfOyFZoN4KrkAKIenbXmks5uYFeTgaJpZM4T0JWi .

tdunning avatar Sep 06 '18 00:09 tdunning

I'm not sure I'd say this is out of date, per se. The above code snippet runs OK with the current master version. However, the latest release version (AFAICT 3.2, released August 2017) will crash with these inputs, and we've seen similar failures in production systems with different inputs.

Is there a plan to release a new version of the library to maven central? It's a really nice library, so we're keen to continue using it and would prefer not to have to run our own internal fork!

rnorth avatar Nov 26 '18 15:11 rnorth

Yes.

I will be releasing a new version before long.

There are a number of improvements to be had, particularly in terms of accuracy and predictability.

Would it help you if I released a patch release before the larger release?

On Mon, Nov 26, 2018 at 3:37 PM Richard North [email protected] wrote:

I'm not sure I'd say this is out of date, per se. The above code snippet runs OK with the current master version. However, the latest release version (AFAICT 3.2, released August 2017) will crash with these inputs, and we've seen similar failures in production systems with different inputs.

Is there a plan to release a new version of the library to maven central? It's a really nice library, so we're keen to continue using it and would prefer not to have to run our own internal fork!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/107#issuecomment-441682617, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPSekutHVlNN2nRukJ7XZW_yO36Jag7ks5uzArBgaJpZM4T0JWi .

tdunning avatar Nov 26 '18 15:11 tdunning

A compression factor of 1 is pretty extreme, isn't it? I wouldn't understand the use of any value less than, say, 10 or 20.

tdunning avatar Nov 26 '18 15:11 tdunning

Thanks for such a quick response. Yes, a patch release would be greatly appreciated.

We have this occurring at a compression factor of 100 (and obviously an entirely different volume of input data points).

rnorth avatar Nov 26 '18 15:11 rnorth

Where is the exception happening?

The problem that I just looked at had to do with the bufferSize getting set too small so that the array copy failed on merging. Where is your problem appearing? Same place?

tdunning avatar Nov 26 '18 16:11 tdunning

In both cases (the repro case above and our error in production) both failed at line 302 of MergingDigest.java, which is the following (in v3.2):

        System.arraycopy(mean, 0, incomingMean, incomingCount, lastUsedCell);

Relevant part of the stack trace:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
	at java.lang.System.arraycopy(Native Method)
	at com.tdunning.math.stats.MergingDigest.merge(MergingDigest.java:302)
	at com.tdunning.math.stats.MergingDigest.mergeNewValues(MergingDigest.java:291)
	at com.tdunning.math.stats.MergingDigest.add(MergingDigest.java:202)
	at com.tdunning.math.stats.MergingDigest.add(MergingDigest.java:194)
	at com.tdunning.math.stats.AbstractTDigest.add(AbstractTDigest.java:131)

I'm afraid right now I can't give much more information about the state of the fields at the time this is occurring, aside from saying that we've seen this on one of our busiest services. I shall try and gather some more useful data tomorrow, though, if it helps.

rnorth avatar Nov 26 '18 16:11 rnorth

Is this fixed?

hangfei avatar May 14 '19 20:05 hangfei

Should be fixed in master. Let me verify.

On Tue, May 14, 2019 at 1:55 PM Carotene [email protected] wrote:

Is this fixed?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/107?email_source=notifications&email_token=AAB5E6VF3ZB7DNQLN2QF4B3PVMRL7A5CNFSM4E6QSWRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVMYLCI#issuecomment-492406153, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB5E6USGFRC2WK4UCYYWZDPVMRL7ANCNFSM4E6QSWRA .

tdunning avatar May 14 '19 22:05 tdunning

Verified.

tdunning avatar May 14 '19 22:05 tdunning

Is this fix released and in which version?

hangfei avatar Feb 08 '20 00:02 hangfei

Related exception in 3.2:

ERROR in (issue-201-incorrect-result-column-count) (System.java:-2)
Uncaught exception, not in assertion.
expected: nil
  actual: java.lang.ArrayIndexOutOfBoundsException: null
 at java.lang.System.arraycopy (System.java:-2)
    com.tdunning.math.stats.MergingDigest.merge (MergingDigest.java:302)
    com.tdunning.math.stats.MergingDigest.mergeNewValues (MergingDigest.java:291)
    com.tdunning.math.stats.MergingDigest.quantile (MergingDigest.java:661)

This is after do a parallelized aggregation where TDigest.add was called to merge separate thread contexts. It happens only once in a while, not every time which is a bummer for a large aggregation.

cnuernber avatar Mar 24 '21 13:03 cnuernber

Chris,

It doesn't sound like you have a test case for this, but could you say which version you are using?

On Wed, Mar 24, 2021 at 6:45 AM Chris Nuernberger @.***> wrote:

Related NPE I believe:

ERROR in (issue-201-incorrect-result-column-count) (System.java:-2)Uncaught exception, not in assertion.expected: nil actual: java.lang.ArrayIndexOutOfBoundsException: null at java.lang.System.arraycopy (System.java:-2) com.tdunning.math.stats.MergingDigest.merge (MergingDigest.java:302) com.tdunning.math.stats.MergingDigest.mergeNewValues (MergingDigest.java:291) com.tdunning.math.stats.MergingDigest.quantile (MergingDigest.java:661)

This is after do a parallelized aggregation where TDigest.add was called to merge separate thread contexts. It happens only once in a while, not every time which is a bummer for a large aggregation.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/107#issuecomment-805834394, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6V2TL2MNEHCXG43TGTTFHUHVANCNFSM4E6QSWRA .

tdunning avatar Mar 24 '21 18:03 tdunning

I have a test case that does this about 3 times out of 100 if I run it in a loop. I am using version 3.2.

I will get test with master:HEAD and see if that eliminates/changes the problem.

cnuernber avatar Mar 24 '21 19:03 cnuernber

I would love to have the test case as well. Not useful as it stands as a unit case, but very, very useful to use to collect data.

One of the pressures for a new release is that several related issues have been fixed in the main branch. (note that the master branch got renamed not long ago)

On Wed, Mar 24, 2021 at 12:06 PM Chris Nuernberger @.***> wrote:

I have a test case that does this about 3 times out of 100 if I run it in a loop. I am using version 3.2.

I will get test with master:HEAD and see if that eliminates/changes the problem.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/107#issuecomment-806081586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6XBFS2ZCNGAPADEGCLTFIZ25ANCNFSM4E6QSWRA .

tdunning avatar Mar 24 '21 19:03 tdunning

100% main:HEAD fixes the test. I ran it many thousands of times just now and saw zero failures.

The test case involves a Clojure datatframe library which means you would be calling Clojure in your unit tests. Are you sure you want the test case :-)? I basically to a N-core reduction across Y datasets followed by thread0_digest.add(rest-of-digests).

This library is fantastic so I am down to help you any way you like.

cnuernber avatar Mar 24 '21 19:03 cnuernber

OK.

This is yet another vote for a release.

Gonna have to do it even without the serialization stuff that is cooking.

Speaking of which, please weigh in on your thoughts.

https://docs.google.com/document/d/1CX9Colw1g58nLHfJ_IR4XeFHhUaiXGTOG0aKkHpTut4/edit#heading=h.oukzhuq9hijs

On Wed, Mar 24, 2021 at 12:25 PM Chris Nuernberger @.***> wrote:

100% main:HEAD fixes the test. I ran it many thousands of times just now and saw zero failures.

The test case involves a Clojure datatframe library which means you would be calling Clojure in your unit tests. Are you sure you want the test case :-)? I basically to a N-core reduction across Y datasets followed by thread0_digest.add(rest-of-digests).

This library is fantastic so I am down to help you any way you like.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/107#issuecomment-806092790, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6WIXOOG24OQNGHM5ILTFI4D3ANCNFSM4E6QSWRA .

tdunning avatar Mar 24 '21 19:03 tdunning

@cnuernber I would like to take you up on your offer.

Can you look at the current main branch plus live issues and identify what you would say are blockers for a point release?

If none and if this sounds like consensus, I will start the process with what we have.

tdunning avatar Mar 24 '21 22:03 tdunning

Yes, will do.

cnuernber avatar Mar 24 '21 22:03 cnuernber

I looked through the issues and have them categorized and such. Here are my opinions- In short I think a point release is worth it now to get the fixes for mergetree.

  • I think it is worth a point release as the fixes to the merge algorithm are substantial in both stability and correctness.
  • AVLTree needs time and effort to stabilize - it does have some issues.
  • For serialization, my recommendation is to do as little effort as necessary to come up with a decently forward compatible versioned POD scheme for mergetree alone. Java serialization is often a disaster and a major security risk and once things are in simple datatypes users can serialize however they like including using Java serialization. This should cut the effort required as it sidesteps the issue of how to actually save the data and avoids unnecessary dependencies. I don't agree Avro is a good choice, parquet is a good option for longterm storage due to its default good compression and wide language support and Arrow is a good option and very well supported across languages now. There are lots of other options not considered such as dendrite; virtually an endless number of them I think. So starting with a carefully built POD (or POJO as Java people call them) I think is reasonable and if people want protobufs or flatbuffers they can auto-generate them from the POD description. For reference, the Arrow project uses flatbuffers as its base serialization format and builds their schema and record pathways off of a basic flatbuffer message type.
  • There appears to be a high-quality header-only c++ implementation that @derrickburns was working on that perhaps would be a good idea. A minimal clean C-library based on that implementation would then get you Python, R, Julia, Go, etc. with no further bindings. So that cuts your core tested language matrix to Java and C++/C.

I messaged you on linked in and we can continue offline from there or we can continue the discussion here; whichever way works for you.

cnuernber avatar Mar 25 '21 14:03 cnuernber

I will close this after the upcoming release.

tdunning avatar Apr 02 '21 03:04 tdunning

(Slight) progress note.

I have decoded the new behavior of gpg and interactions with the maven code signing (short answer, much better than before). Should be able to cut a release shortly.

tdunning avatar Apr 08 '21 19:04 tdunning

Hi @cnuernber, could you explain how you solved the bug ? I'm having the same one, really rarely also.

rnataf avatar Sep 06 '22 09:09 rnataf

Well, two ways. Honestly I think this library has been superceded by apache datasketches. What we did before I started work with datasketches was I patched the library and released it on Clojars under a different name.

If you are doing work in this area of course I have to recommend our data processing system :-).

cnuernber avatar Sep 06 '22 17:09 cnuernber

Well, two ways. Honestly I think this library has been superceded by apache datasketches. What we did before I started work with datasketches was I patched the library and released it on Clojars under a different name.

If you are doing work in this area of course I have to recommend our data processing system :-).

hi,@cnuernber, I saw that your fork just submitted two configurations, how did you solve the current problem?(https://github.com/tdunning/t-digest/compare/main...techascent:t-digest:main)

wxbty avatar Jul 06 '23 14:07 wxbty

@cnuernber Interesting to see the work on tech.ml.dataset. It frankly looks a bit like a reinvention of Arrow, but just in Clojure, at least for the columnar in-memory format work.

tdunning avatar Jul 06 '23 18:07 tdunning