commons-lang icon indicating copy to clipboard operation
commons-lang copied to clipboard

[LANG-1576] refine StringUtils.chomp

Open XenoAmess opened this issue 5 years ago • 18 comments

as title.

XenoAmess avatar Jun 29 '20 03:06 XenoAmess

Coverage Status

Coverage increased (+0.0003%) to 94.958% when pulling 46350d4300c33f1821ea7d7cc561e4e348233736 on xenoamess-fork:refine_chomp into 8d35d66ba9fe71e9f7bc853425e4bce13c961283 on apache:master.

coveralls avatar Jun 29 '20 05:06 coveralls

If I understand the benchmark correctly, the New method is approx 3% slower for Test1 and about 2% faster for Test2. Does not seem like a huge gain.

sebbASF avatar Jul 26 '20 17:07 sebbASF

Does not seem like a huge gain.

Yes, it isn't.

XenoAmess avatar Jul 26 '20 17:07 XenoAmess

The JMH results are very long and obscure the thread of the PR.

Is it possible to post just the summary inline, with a link to the full results in an attachment? That would make it much easier to follow the PR.

sebbASF avatar Jul 26 '20 23:07 sebbASF

The JMH results are very long and obscure the thread of the PR.

Is it possible to post just the summary inline, with a link to the full results in an attachment? That would make it much easier to follow the PR.

@sebbASF Yes, you are right. I will try ubuntu pastebin.

XenoAmess avatar Jul 28 '20 10:07 XenoAmess

As a result: I re-refine the codes and redone the jmh test. Sorry for the delay because I was quite busy yesterday, and the test takes too long time(several hours on my pc.) full jmh result at: https://pastebin.ubuntu.com/p/wbsqtzrkyC/ In short:

Benchmark                            Mode  Cnt      Score     Error  Units
StringUtilsChompTest.test1New        avgt   25      2.368 ?  0.011  ns/op

StringUtilsChompTest.test1Old        avgt   25      2.434 ?  0.102  ns/op

StringUtilsChompTest.test2New        avgt   25      2.364 ?  0.010  ns/op

StringUtilsChompTest.test2Old        avgt   25      2.368 ?  0.013  ns/op

StringUtilsChompTest.test3New        avgt   25      2.355 ?  0.007  ns/op

StringUtilsChompTest.test3Old        avgt   25      2.556 ?  0.035  ns/op

StringUtilsChompTest.test4New        avgt   25      6.352 ?  0.149  ns/op

StringUtilsChompTest.test4Old        avgt   25      6.494 ?  0.363  ns/op

StringUtilsChompTest.test5New        avgt   25      6.272 ?  0.086  ns/op

StringUtilsChompTest.test5Old        avgt   25      6.248 ?  0.035  ns/op

StringUtilsChompTest.test6New        avgt   25      6.304 ?  0.079  ns/op

StringUtilsChompTest.test6Old        avgt   25      6.298 ?  0.061  ns/op

StringUtilsChompTest.testStringsNew  avgt   25  72976.756 ?304.822  ns/op

StringUtilsChompTest.testStringsOld  avgt   25  78290.023 ?657.792  ns/op

The test shows it is almost even speed when ends with \r or \n, but 6.7% faster when ends with no \r nor \n,which I think is the most cases we use the function. If I still done something stupidly wrong just tell me, will try to fix it when have any time. Thanks.

XenoAmess avatar Jul 28 '20 10:07 XenoAmess

performance tests refined. thanks for help from @aherbert full test result at : https://pastebin.ubuntu.com/p/mC3wTgsCKT/ in short,

Benchmark                                       Mode  Cnt      Score       Error  Units
StringUtilsChompTest.test10_Random_Strings_New  avgt    5  67574.592 ? 9156.972  ns/op

StringUtilsChompTest.test10_Random_Strings_Old  avgt    5  76213.928 ?12127.719  ns/op

StringUtilsChompTest.test1_Empty_New            avgt    5      2.375 ?    0.087  ns/op

StringUtilsChompTest.test1_Empty_Old            avgt    5      2.387 ?    0.085  ns/op

StringUtilsChompTest.test2_No_Chomp_New         avgt    5      2.370 ?    0.120  ns/op

StringUtilsChompTest.test2_No_Chomp_Old         avgt    5      2.386 ?    0.104  ns/op

StringUtilsChompTest.test3_No_Chomp_New         avgt    5      2.419 ?    0.096  ns/op

StringUtilsChompTest.test3_No_Chomp_Old         avgt    5      2.377 ?    0.162  ns/op

StringUtilsChompTest.test4_R_New                avgt    5      2.402 ?    0.106  ns/op

StringUtilsChompTest.test4_R_Old                avgt    5      2.361 ?    0.023  ns/op

StringUtilsChompTest.test5_N_New                avgt    5      2.357 ?    0.005  ns/op

StringUtilsChompTest.test5_N_Old                avgt    5      2.397 ?    0.167  ns/op

StringUtilsChompTest.test6_R_N_New              avgt    5      6.013 ?    0.245  ns/op

StringUtilsChompTest.test6_R_N_Old              avgt    5      6.133 ?    0.788  ns/op

StringUtilsChompTest.test7_a_N_New              avgt    5      6.407 ?    0.343  ns/op

StringUtilsChompTest.test7_a_N_Old              avgt    5      6.717 ?    1.407  ns/op

StringUtilsChompTest.test8_a_N_New              avgt    5      6.678 ?    0.752  ns/op

StringUtilsChompTest.test8_a_N_Old              avgt    5      6.580 ?    0.208  ns/op

StringUtilsChompTest.test9_a_R_N_New            avgt    5      6.498 ?    0.672  ns/op

StringUtilsChompTest.test9_a_R_N_Old            avgt    5      6.770 ?    2.015  ns/op

XenoAmess avatar Jul 28 '20 12:07 XenoAmess

@aherbert Hi. I changed the jmh codes according to your suggestions, and re-run performance test. full test at: https://pastebin.ubuntu.com/p/cqNKS35zSP/ in short:

Benchmark                                          (data)  (name)  Mode  Cnt      Score       Error  Units
StringUtilsChompTest.singleString                    NULL     old  avgt    5      2.737 ?    0.166  ns/op

StringUtilsChompTest.singleString                    NULL     new  avgt    5      2.789 ?    0.213  ns/op

StringUtilsChompTest.singleString                   CHAR0     old  avgt    5      3.227 ?    0.459  ns/op

StringUtilsChompTest.singleString                   CHAR0     new  avgt    5      3.168 ?    0.233  ns/op

StringUtilsChompTest.singleString                CHAR0_CR     old  avgt    5      3.706 ?    0.035  ns/op

StringUtilsChompTest.singleString                CHAR0_CR     new  avgt    5      3.674 ?    0.480  ns/op

StringUtilsChompTest.singleString                CHAR0_LF     old  avgt    5      3.656 ?    0.410  ns/op

StringUtilsChompTest.singleString                CHAR0_LF     new  avgt    5      3.682 ?    0.525  ns/op

StringUtilsChompTest.singleString             CHAR0_CR_LF     old  avgt    5      9.538 ?    1.558  ns/op

StringUtilsChompTest.singleString             CHAR0_CR_LF     new  avgt    5      9.359 ?    0.499  ns/op

StringUtilsChompTest.singleString                   CHAR1     old  avgt    5      3.880 ?    0.531  ns/op

StringUtilsChompTest.singleString                   CHAR1     new  avgt    5      3.803 ?    0.451  ns/op

StringUtilsChompTest.singleString                CHAR1_CR     old  avgt    5     17.378 ?    2.431  ns/op

StringUtilsChompTest.singleString                CHAR1_CR     new  avgt    5     21.752 ?   23.793  ns/op

StringUtilsChompTest.singleString                CHAR1_LF     old  avgt    5     19.025 ?    4.690  ns/op

StringUtilsChompTest.singleString                CHAR1_LF     new  avgt    5     18.087 ?    2.450  ns/op

StringUtilsChompTest.singleString             CHAR1_CR_LF     old  avgt    5     17.940 ?    1.500  ns/op

StringUtilsChompTest.singleString             CHAR1_CR_LF     new  avgt    5     18.130 ?    1.195  ns/op

StringUtilsChompTest.singleString                   CHAR2     old  avgt    5      4.216 ?    1.192  ns/op

StringUtilsChompTest.singleString                   CHAR2     new  avgt    5      4.341 ?    0.310  ns/op

StringUtilsChompTest.singleString                CHAR2_CR     old  avgt    5     17.308 ?    2.704  ns/op

StringUtilsChompTest.singleString                CHAR2_CR     new  avgt    5     18.087 ?    2.312  ns/op

StringUtilsChompTest.singleString                CHAR2_LF     old  avgt    5     18.527 ?    2.699  ns/op

StringUtilsChompTest.singleString                CHAR2_LF     new  avgt    5     18.200 ?    1.919  ns/op

StringUtilsChompTest.singleString             CHAR2_CR_LF     old  avgt    5     17.692 ?    0.991  ns/op

StringUtilsChompTest.singleString             CHAR2_CR_LF     new  avgt    5     18.285 ?    2.505  ns/op

StringUtilsChompTest.test_Random_Strings_New          N/A     N/A  avgt    5  66999.815 ?10053.205  ns/op

StringUtilsChompTest.test_Random_Strings_Old          N/A     N/A  avgt    5  76290.287 ?15986.436  ns/op

XenoAmess avatar Aug 08 '20 04:08 XenoAmess

Hi. I re-changed some codes and run a performance test. In short,

Benchmark                                          (data)  (name)  Mode  Cnt      Score      Error  Units
StringUtilsChompTest.singleString                    NULL     old  avgt    5      2.751 ?   0.091  ns/op

StringUtilsChompTest.singleString                    NULL     new  avgt    5      2.767 ?   0.182  ns/op

StringUtilsChompTest.singleString                   CHAR0     old  avgt    5      3.687 ?   2.040  ns/op

StringUtilsChompTest.singleString                   CHAR0     new  avgt    5      3.221 ?   0.070  ns/op

StringUtilsChompTest.singleString                CHAR0_CR     old  avgt    5      3.757 ?   0.249  ns/op

StringUtilsChompTest.singleString                CHAR0_CR     new  avgt    5      3.718 ?   0.171  ns/op

StringUtilsChompTest.singleString                CHAR0_LF     old  avgt    5      3.780 ?   0.362  ns/op

StringUtilsChompTest.singleString                CHAR0_LF     new  avgt    5      3.829 ?   0.456  ns/op

StringUtilsChompTest.singleString             CHAR0_CR_LF     old  avgt    5      9.466 ?   0.550  ns/op

StringUtilsChompTest.singleString             CHAR0_CR_LF     new  avgt    5      9.506 ?   0.441  ns/op

StringUtilsChompTest.singleString                   CHAR1     old  avgt    5      4.035 ?   1.104  ns/op

StringUtilsChompTest.singleString                   CHAR1     new  avgt    5      3.780 ?   0.239  ns/op

StringUtilsChompTest.singleString                CHAR1_CR     old  avgt    5     18.910 ?   2.074  ns/op

StringUtilsChompTest.singleString                CHAR1_CR     new  avgt    5     19.492 ?   1.942  ns/op

StringUtilsChompTest.singleString                CHAR1_LF     old  avgt    5     19.429 ?   1.860  ns/op

StringUtilsChompTest.singleString                CHAR1_LF     new  avgt    5     22.080 ?   6.952  ns/op

StringUtilsChompTest.singleString             CHAR1_CR_LF     old  avgt    5     22.026 ?   2.968  ns/op

StringUtilsChompTest.singleString             CHAR1_CR_LF     new  avgt    5     18.868 ?   1.075  ns/op

StringUtilsChompTest.singleString                   CHAR2     old  avgt    5      4.068 ?   0.168  ns/op

StringUtilsChompTest.singleString                   CHAR2     new  avgt    5      3.896 ?   0.361  ns/op

StringUtilsChompTest.singleString                CHAR2_CR     old  avgt    5     17.723 ?   0.809  ns/op

StringUtilsChompTest.singleString                CHAR2_CR     new  avgt    5     19.235 ?   0.502  ns/op

StringUtilsChompTest.singleString                CHAR2_LF     old  avgt    5     21.363 ?   1.176  ns/op

StringUtilsChompTest.singleString                CHAR2_LF     new  avgt    5     21.182 ?   1.029  ns/op

StringUtilsChompTest.singleString             CHAR2_CR_LF     old  avgt    5     21.532 ?   1.032  ns/op

StringUtilsChompTest.singleString             CHAR2_CR_LF     new  avgt    5     21.420 ?   2.569  ns/op

StringUtilsChompTest.test_Random_Strings_New          N/A     N/A  avgt    5  70565.068 ?2096.854  ns/op

StringUtilsChompTest.test_Random_Strings_Old          N/A     N/A  avgt    5  77950.673 ?2115.093  ns/op

detailed report at https://pastebin.ubuntu.com/p/bTNkPGpSpV/

@sebbASF yes, the performance test shows the original condition struction is not bad as I thought at beginning, and revert to original condition struction will not harm performance. so according to the "minimal change" principle, I changed the codes to looks more likely to original.

XenoAmess avatar Aug 10 '20 04:08 XenoAmess

Thanks - much easier to see what has been changed now. i.e. cache the string length, and don't use substring unless it is needed.

I don't think the RandomStrings test is a fair benchmark. The behaviour of the method depends only on the last one or two characters (and the length). AFAICT the strings are not random and don't have a mix of line-endings. So I think the test only measures the efficiency of String.substring where nothing needs to be dropped.

sebbASF avatar Aug 10 '20 10:08 sebbASF

@sebbASF

I don't think the RandomStrings test is a fair benchmark.

Actually it is fair. See the string array generation function of that test.

XenoAmess avatar Aug 10 '20 14:08 XenoAmess

Sorry, I see now that the strings do have a mix of CR and LF endings (or neither). However, the strings are all of length 2. The strings are not representative of the likely use cases.

As to the change itself, it looks fine, but IMO the benchmark needs some work.

sebbASF avatar Aug 11 '20 09:08 sebbASF

@sebbASF Hi.

Sorry, I see now that the strings do have a mix of CR and LF endings (or neither). However, the strings are all of length 2. The strings are not representative of the likely use cases.

The behaviour of the method depends only on the last one or two characters (and the length). And I don't think add more chars will help. If we increase the length to some normal length of our usecases (for example 6) it will be a really large test.

XenoAmess avatar Aug 15 '20 04:08 XenoAmess

I think the current huge list of input strings is more about testing functionality rather than performance.

The method only cares about 3 characters: CR, LF or something else (unless it has a bug). So the performance test needs to check those in various combinations. I think that is just 9 combinations.

It also needs to test the length, because that is used when doing the substring. The method is likely to be used with textual input so it would make sense to try with a selection of lengths. Not sure what the maximum should be, probably at least 1000, maybe considerably more. It might make sense to do these as separate tests to see if the length affects the performance.

sebbASF avatar Aug 15 '20 09:08 sebbASF

@sebbASF

I think the current huge list of input strings is more about testing functionality rather than performance.

The method only cares about 3 characters: CR, LF or something else (unless it has a bug). So the performance test needs to check those in various combinations. I think that is just 9 combinations.

Not really. More cases are:

  1. null string.
  2. string of length 0 ("")
  3. string of length 1 who ends with '\r'
  4. string of length 1 who ends with '\n'
  5. string of length 1 who ends with normal char

While 3,4,5 be actually handled by a same if, I'd prefer hanle them seperately. So do 1 and 2.

It also needs to test the length, because that is used when doing the substring. The method is likely to be used with textual input so it would make sense to try with a selection of lengths. Not sure what the maximum should be, probably at least 1000, maybe considerably more. It might make sense to do these as separate tests to see if the length affects the performance.

So you mean we should add test for some specific length of strings? For example, "a"*1024, "a"*10240" and "a"*102400"? Fine, then I will add/rerun it if you need this data, at later today.

XenoAmess avatar Aug 16 '20 13:08 XenoAmess

You are correct - I was forgetting that the characters could be missing.

As to testing the length, yes I think we do need to test for various lengths. Longer substrings may take longer to create; if that is true it should show that the new code is better.

sebbASF avatar Aug 16 '20 14:08 sebbASF

Hi. I added some more lengths of Strings to the benchmark, as suggested by @sebbASF . Thanks. full benchmark at https://pastebin.ubuntu.com/p/yPH4xnKqZd/ In short,

Benchmark                                               (data)  (name)  Mode  Cnt      Score      Error  Units
StringUtilsChompTest.singleString                         NULL     old  avgt    5      3.246 ?   0.091  ns/op

StringUtilsChompTest.singleString                         NULL     new  avgt    5      3.265 ?   0.153  ns/op

StringUtilsChompTest.singleString                        CHAR0     old  avgt    5      3.901 ?   0.251  ns/op

StringUtilsChompTest.singleString                        CHAR0     new  avgt    5      3.863 ?   0.769  ns/op

StringUtilsChompTest.singleString                     CHAR0_CR     old  avgt    5      4.393 ?   0.519  ns/op

StringUtilsChompTest.singleString                     CHAR0_CR     new  avgt    5      4.335 ?   0.393  ns/op

StringUtilsChompTest.singleString                     CHAR0_LF     old  avgt    5      4.279 ?   0.295  ns/op

StringUtilsChompTest.singleString                     CHAR0_LF     new  avgt    5      4.315 ?   0.230  ns/op

StringUtilsChompTest.singleString                  CHAR0_CR_LF     old  avgt    5     10.768 ?   0.643  ns/op

StringUtilsChompTest.singleString                  CHAR0_CR_LF     new  avgt    5      9.901 ?   0.136  ns/op

StringUtilsChompTest.singleString                        CHAR1     old  avgt    5      4.358 ?   0.257  ns/op

StringUtilsChompTest.singleString                        CHAR1     new  avgt    5      3.989 ?   0.117  ns/op

StringUtilsChompTest.singleString                     CHAR1_CR     old  avgt    5     19.181 ?   0.715  ns/op

StringUtilsChompTest.singleString                     CHAR1_CR     new  avgt    5     19.502 ?   1.590  ns/op

StringUtilsChompTest.singleString                     CHAR1_LF     old  avgt    5     20.168 ?   1.003  ns/op

StringUtilsChompTest.singleString                     CHAR1_LF     new  avgt    5     19.578 ?   0.575  ns/op

StringUtilsChompTest.singleString                  CHAR1_CR_LF     old  avgt    5     19.984 ?   1.332  ns/op

StringUtilsChompTest.singleString                  CHAR1_CR_LF     new  avgt    5     19.834 ?   1.017  ns/op

StringUtilsChompTest.singleString                        CHAR2     old  avgt    5      4.408 ?   0.098  ns/op

StringUtilsChompTest.singleString                        CHAR2     new  avgt    5      4.033 ?   0.231  ns/op

StringUtilsChompTest.singleString                     CHAR2_CR     old  avgt    5     19.056 ?   1.116  ns/op

StringUtilsChompTest.singleString                     CHAR2_CR     new  avgt    5     18.928 ?   0.903  ns/op

StringUtilsChompTest.singleString                     CHAR2_LF     old  avgt    5     21.658 ?   2.624  ns/op

StringUtilsChompTest.singleString                     CHAR2_LF     new  avgt    5     20.125 ?   1.185  ns/op

StringUtilsChompTest.singleString                  CHAR2_CR_LF     old  avgt    5     19.865 ?   0.865  ns/op

StringUtilsChompTest.singleString                  CHAR2_CR_LF     new  avgt    5     20.033 ?   1.695  ns/op

StringUtilsChompTest.singleString                     CHAR1024     old  avgt    5      4.469 ?   0.186  ns/op

StringUtilsChompTest.singleString                     CHAR1024     new  avgt    5      4.051 ?   0.146  ns/op

StringUtilsChompTest.singleString                  CHAR1024_CR     old  avgt    5    102.333 ?   1.823  ns/op

StringUtilsChompTest.singleString                  CHAR1024_CR     new  avgt    5    102.237 ?   0.913  ns/op

StringUtilsChompTest.singleString                  CHAR1024_LF     old  avgt    5    104.155 ?   2.604  ns/op

StringUtilsChompTest.singleString                  CHAR1024_LF     new  avgt    5    103.365 ?   2.392  ns/op

StringUtilsChompTest.singleString               CHAR1024_CR_LF     old  avgt    5    103.665 ?   4.241  ns/op

StringUtilsChompTest.singleString               CHAR1024_CR_LF     new  avgt    5    103.453 ?   1.482  ns/op

StringUtilsChompTest.singleString                    CHAR10240     old  avgt    5      4.772 ?   1.314  ns/op

StringUtilsChompTest.singleString                    CHAR10240     new  avgt    5      4.064 ?   0.085  ns/op

StringUtilsChompTest.singleString                 CHAR10240_CR     old  avgt    5    981.277 ?  15.790  ns/op

StringUtilsChompTest.singleString                 CHAR10240_CR     new  avgt    5    985.230 ?  13.679  ns/op

StringUtilsChompTest.singleString                 CHAR10240_LF     old  avgt    5    983.508 ?  15.570  ns/op

StringUtilsChompTest.singleString                 CHAR10240_LF     new  avgt    5   1000.211 ?  48.633  ns/op

StringUtilsChompTest.singleString              CHAR10240_CR_LF     old  avgt    5    990.806 ?   7.380  ns/op

StringUtilsChompTest.singleString              CHAR10240_CR_LF     new  avgt    5    990.001 ?   9.229  ns/op

StringUtilsChompTest.singleString                   CHAR102400     old  avgt    5      4.474 ?   0.142  ns/op

StringUtilsChompTest.singleString                   CHAR102400     new  avgt    5      4.154 ?   0.325  ns/op

StringUtilsChompTest.singleString                CHAR102400_CR     old  avgt    5  10077.261 ? 616.639  ns/op

StringUtilsChompTest.singleString                CHAR102400_CR     new  avgt    5  10002.896 ?  85.774  ns/op

StringUtilsChompTest.singleString                CHAR102400_LF     old  avgt    5  10081.275 ?  64.053  ns/op

StringUtilsChompTest.singleString                CHAR102400_LF     new  avgt    5  10085.129 ?  43.548  ns/op

StringUtilsChompTest.singleString             CHAR102400_CR_LF     old  avgt    5  10044.759 ?  68.778  ns/op

StringUtilsChompTest.singleString             CHAR102400_CR_LF     new  avgt    5  10097.769 ? 101.311  ns/op

StringUtilsChompTest.test_Random_Strings_New               N/A     N/A  avgt    5  76270.389 ?1885.347  ns/op

StringUtilsChompTest.test_Random_Strings_Old               N/A     N/A  avgt    5  82757.374 ?9869.516  ns/op

XenoAmess avatar Aug 17 '20 05:08 XenoAmess

@garydgregory rebased. please find some time to review. thanks.

XenoAmess avatar Feb 21 '21 16:02 XenoAmess