v icon indicating copy to clipboard operation
v copied to clipboard

builtin: Make all string.index methods return ?int (second try)

Open txgruppi opened this issue 3 years ago • 19 comments

The string type has come index methods like (index, index_after, etc) and some return int and others return ?int.

This MR makes all of them return ?int.

The tests (v test-all) have beem updated to work with the changes but there are some tests that were skipped in my machine.


All string index methods that end with _opt return ?int with none as the value when the substring is not found and the index methods that do not end in _opt return int with -1 as the returned value when the substring is not found


This PR is the second iteration on the idea. Origin PR here: https://github.com/vlang/v/pull/13693

It was easier to dump the last changes and redo them from the latest weekly tag.


VLS PR: https://github.com/vlang/vls/pull/329

txgruppi avatar Mar 17 '22 12:03 txgruppi

@JalonSolov @yuyi98 @spytheman

I think this one will make everyone happy.

txgruppi avatar Mar 17 '22 12:03 txgruppi

I for myself am really not a fan of duplicated methods. So having second variants (those ending with _opt) of certain methods sounds bad.

Also this PR doesn't endorse safe-by-default behavior - i.e. the syntactically shorter variant shall be the one returning int | none and the syntactically longer variant shall return int.

All in all what I'd be interested in the most are the (rudimentary) benchmarks @spytheman asked for in the previous PR discussion.

dumblob avatar Mar 17 '22 22:03 dumblob

The benchmarks.

Int version: Summary for done: 10000 passed, 10000 total. Runtime: 5459 ms. Opt version: Summary for done: 10000 passed, 10000 total. Runtime: 5130 ms.

Each code was built using v -prod <source> and executed ./<binary>.

To run this I generated an html file with lorem ipsum text where every word with 5 letters and no symbol I updated to <a>. The script's job was to extract all the url and text from the html file.

First I made them println('$url $text') and compared the output with a Node.js script I created to do the same job. All 3 outputs (v int, v opt, node) are identical.

For the benchmark I commented out the lines where I defined the variables url and text and the line where I printed the test string and added the loop for the benchmark.

The index_after and index_after_opt have the same implementation with the only differences:

  1. return type int and ?int
  2. return values for not found -1 and none

You can check the code I used in https://gist.github.com/txgruppi/31663efdb380ced7f8d31f6f5f4c3d98

txgruppi avatar Mar 19 '22 04:03 txgruppi

Thanks for the benchmarks.

spytheman avatar Mar 19 '22 07:03 spytheman

I've made some simplifications to the extract-int.v and extract-opt.v files, to reduce the additional operations, that the loops perform (s.substr especially since it does allocation, b.step, b.ok), and added a Makefile to compare different modes of compilation more easily. The modified version is in: https://gist.github.com/spytheman/ef8f298c48fbc7678c1ce281385f511c

Here is the result on my machine:

0[11:01:56]delian@nemesis: (master) /v/misc/2022_03_19__01/index_int_vs_index_opt $ make clean all compare
rm -rf extract-int-tcc extract-int-gcc extract-int-clang extract-opt-tcc extract-opt-gcc extract-opt-clang extract-int-tcc-prod extract-int-gcc-prod extract-int-clang-prod extract-opt-tcc-prod extract-opt-gcc-prod extract-opt-clang-prod
v -cc tcc -o extract-int-tcc extract-int.v
v -cc gcc -o extract-int-gcc extract-int.v
v -cc clang -o extract-int-clang extract-int.v
v -cc tcc -o extract-opt-tcc extract-opt.v
v -cc gcc -o extract-opt-gcc extract-opt.v
v -cc clang -o extract-opt-clang extract-opt.v
v -prod -cc tcc -o extract-int-tcc-prod extract-int.v
Note: tcc is not recommended for -prod builds
v -prod -cc gcc -o extract-int-gcc-prod extract-int.v
v -prod -cc clang -o extract-int-clang-prod extract-int.v
v -prod -cc tcc -o extract-opt-tcc-prod extract-opt.v
Note: tcc is not recommended for -prod builds
v -prod -cc gcc -o extract-opt-gcc-prod extract-opt.v
v -prod -cc clang -o extract-opt-clang-prod extract-opt.v
hyperfine './extract-int-tcc        1000'   './extract-opt-tcc        1000'
Benchmark #1: ./extract-int-tcc        1000
  Time (mean ± σ):      2.441 s ±  0.024 s    [User: 2.435 s, System: 0.005 s]
  Range (min … max):    2.404 s …  2.487 s    10 runs
 
Benchmark #2: ./extract-opt-tcc        1000
  Time (mean ± σ):      3.142 s ±  0.030 s    [User: 3.139 s, System: 0.001 s]
  Range (min … max):    3.089 s …  3.186 s    10 runs
 
Summary
  './extract-int-tcc        1000' ran
    1.29 ± 0.02 times faster than './extract-opt-tcc        1000'
hyperfine './extract-int-gcc        1000'   './extract-opt-gcc        1000'
Benchmark #1: ./extract-int-gcc        1000
  Time (mean ± σ):      1.772 s ±  0.040 s    [User: 1.769 s, System: 0.002 s]
  Range (min … max):    1.728 s …  1.850 s    10 runs
 
Benchmark #2: ./extract-opt-gcc        1000
  Time (mean ± σ):      2.190 s ±  0.023 s    [User: 2.186 s, System: 0.002 s]
  Range (min … max):    2.160 s …  2.233 s    10 runs
 
Summary
  './extract-int-gcc        1000' ran
    1.24 ± 0.03 times faster than './extract-opt-gcc        1000'
hyperfine './extract-int-clang      1000'   './extract-opt-clang      1000'
Benchmark #1: ./extract-int-clang      1000
  Time (mean ± σ):      1.858 s ±  0.027 s    [User: 1.857 s, System: 0.000 s]
  Range (min … max):    1.818 s …  1.902 s    10 runs
 
Benchmark #2: ./extract-opt-clang      1000
  Time (mean ± σ):      2.386 s ±  0.080 s    [User: 2.382 s, System: 0.002 s]
  Range (min … max):    2.265 s …  2.495 s    10 runs
 
Summary
  './extract-int-clang      1000' ran
    1.28 ± 0.05 times faster than './extract-opt-clang      1000'
hyperfine './extract-int-tcc-prod   1000'   './extract-opt-tcc-prod   1000'
Benchmark #1: ./extract-int-tcc-prod   1000
  Time (mean ± σ):      1.942 s ±  0.030 s    [User: 1.940 s, System: 0.001 s]
  Range (min … max):    1.904 s …  1.985 s    10 runs
 
Benchmark #2: ./extract-opt-tcc-prod   1000
  Time (mean ± σ):      2.702 s ±  0.045 s    [User: 2.698 s, System: 0.001 s]
  Range (min … max):    2.628 s …  2.798 s    10 runs
 
Summary
  './extract-int-tcc-prod   1000' ran
    1.39 ± 0.03 times faster than './extract-opt-tcc-prod   1000'
hyperfine './extract-int-gcc-prod   1000'   './extract-opt-gcc-prod   1000'
Benchmark #1: ./extract-int-gcc-prod   1000
  Time (mean ± σ):     495.2 ms ±   5.7 ms    [User: 493.4 ms, System: 1.5 ms]
  Range (min … max):   488.8 ms … 508.3 ms    10 runs
 
Benchmark #2: ./extract-opt-gcc-prod   1000
  Time (mean ± σ):     520.5 ms ±   5.0 ms    [User: 518.1 ms, System: 2.2 ms]
  Range (min … max):   514.8 ms … 528.1 ms    10 runs
 
Summary
  './extract-int-gcc-prod   1000' ran
    1.05 ± 0.02 times faster than './extract-opt-gcc-prod   1000'
hyperfine './extract-int-clang-prod 1000'   './extract-opt-clang-prod 1000'
Benchmark #1: ./extract-int-clang-prod 1000
  Time (mean ± σ):     539.9 ms ±  10.1 ms    [User: 536.4 ms, System: 3.0 ms]
  Range (min … max):   523.5 ms … 554.0 ms    10 runs
 
Benchmark #2: ./extract-opt-clang-prod 1000
  Time (mean ± σ):     532.8 ms ±   9.9 ms    [User: 530.6 ms, System: 1.5 ms]
  Range (min … max):   516.3 ms … 548.9 ms    10 runs
 
Summary
  './extract-opt-clang-prod 1000' ran
    1.01 ± 0.03 times faster than './extract-int-clang-prod 1000'

Except for the last case (./extract-opt-clang-prod vs extract-int-clang-prod), the version returning pure int was faster, varying between 5% to 39% faster, compared to the version that returns ?int.

When using -prod, both clang and gcc managed to optimize well the difference.

I have to note however, that when I compared the more common mixed usage (.index_after calls intermixed with other methods like .substr), the performance difference becomes insignificant, since other more important factors become more dominant (allocation mostly).

spytheman avatar Mar 19 '22 09:03 spytheman

Also this PR doesn't endorse safe-by-default behavior - i.e. the syntactically shorter variant shall be the one returning int | none and the syntactically longer variant shall return int.

Another important consideration is readability - the ?int using code, does look more readable (and more compact), compared to the int version, because of the syntax support that V and vfmt provide for that common pattern.

Compare:

        for i := 0; i < max; i++ {
                mut start := -1
                mut end := -1
                for {
                        start = raw.index_after(atag, end)
                        if start == -1 {
                                break
                        }
                        start = raw.index_after(href, start)
                        if start == -1 {
                                break
                        }
                        start += 6
                        end = start
                        start = raw.index_after(gt, end)
                        if start == -1 {
                                break
                        }
                        start += 1
                        end = raw.index_after(lt, start)
                        if end == -1 {
                                break
                        }
                }
                checksum+=u64(end)
        }

vs

        for i := 0; i < max; i++ {
                mut start := -1
                mut end := -1
                for {
                        start = raw.index_after_opt(atag, end) or { break }
                        start = raw.index_after_opt(href, start) or { break }
                        start += 6
                        end = start
                        start = raw.index_after_opt(gt, end) or { break }
                        start += 1
                        end = raw.index_after_opt(lt, start) or { break }
                }
                checksum+=u64(end)
        }

Given all of the above, I think the faster int version should stay, and the optional ?int versions should be just wrappers, that call the int versions, with an _opt suffix in their names, because using the shorter names for the ?int versions will be a breaking change.

In this way, users could call the more inconvenient, but faster versions where that matters, and the more convenient _opt variants, outside of hot loops, while code duplication will be minimized, and breaking changes will be avoided.

@medvednikov what do you think?

spytheman avatar Mar 19 '22 10:03 spytheman

using the shorter names for the ?int versions will be a breaking change

Actually, I am wrong about this - currently on master, .index() does return an ?int, so this PR will be a breaking change, even in case we decide to have _opt suffixes for the slower but more convenient ?int versions, and the shorter names for the int returning methods.

In that case, I do support using ?int for the shorter names and an _int suffix for the faster less convenient int returning versions, while the ?int methods should still be wrappers around the faster int versions.

spytheman avatar Mar 19 '22 10:03 spytheman

Any change we make will be a breaking change. In the current state we have some methods returning int and some returning ?int. My reason for this MR was to make them all follow the same standard, it was not an attempt to push ?int as the default.

Let me know what is the final decision after you're all in agreement so I can update the PR.

Cheers.

txgruppi avatar Mar 19 '22 21:03 txgruppi

@spytheman @dumblob

I think it is ready for you.

txgruppi avatar Apr 30 '22 00:04 txgruppi

I tried running v test-all in Windows and OSX but I'm not familiar with these and I'm not sure I did a good job on them.

txgruppi avatar Apr 30 '22 00:04 txgruppi

@txgruppi please fix VLS

server/diagnostics.v:12:24: error: index_after() returns an option, so it should have either an `or {}` block, or `?` at the end
   10 |     }
   11 |
   12 |     line_colon_idx := msg.index_after(':', 2) // deal with `d:/v/...:2:4: error: ...`
      |                           ~~~~~~~~~~~~~~~~~~~
   13 |     if line_colon_idx < 0 {
   14 |         return none
server/diagnostics.v:21:23: error: index_after() returns an option, so it should have either an `or {}` block, or `?` at the end
   19 |     }
   20 |
   21 |     col_colon_idx := msg.index_after(':', line_colon_idx + 1)
      |                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   22 |     colon_sep_idx := msg.index_after(':', col_colon_idx + 1)
   23 |     msg_type_colon_idx := msg.index_after(':', colon_sep_idx + 1)
server/diagnostics.v:22:23: error: index_after() returns an option, so it should have either an `or {}` block, or `?` at the end
   20 |
   21 |     col_colon_idx := msg.index_after(':', line_colon_idx + 1)
   22 |     colon_sep_idx := msg.index_after(':', col_colon_idx + 1)
      |                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   23 |     msg_type_colon_idx := msg.index_after(':', colon_sep_idx + 1)
   24 |     if msg_type_colon_idx == -1 || col_colon_idx == -1 || colon_sep_idx == -1 {
server/diagnostics.v:23:28: error: index_after() returns an option, so it should have either an `or {}` block, or `?` at the end
   21 |     col_colon_idx := msg.index_after(':', line_colon_idx + 1)
   22 |     colon_sep_idx := msg.index_after(':', col_colon_idx + 1)
   23 |     msg_type_colon_idx := msg.index_after(':', colon_sep_idx + 1)
      |                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   24 |     if msg_type_colon_idx == -1 || col_colon_idx == -1 || colon_sep_idx == -1 {
   25 |         return error('idx is -1')

StunxFS avatar May 01 '22 00:05 StunxFS

@StunxFS https://github.com/vlang/vls/pull/329

txgruppi avatar May 02 '22 17:05 txgruppi

Any news here?

txgruppi avatar May 13 '22 04:05 txgruppi

What are the failing checks? I think we can re-run CI as the PR looks OK to me.

dumblob avatar May 13 '22 13:05 dumblob

@txgruppi the PR is fine, only a small fix was missing in ved, which I have already done: https://github.com/vlang/ved/pull/140.

StunxFS avatar May 19 '22 23:05 StunxFS

This PR is ready to be merged then, as the only failing test is due to https://github.com/vlang/ved/pull/140 not yet merged

Hunam6 avatar May 26 '22 18:05 Hunam6

I want to update you all.

I was adding more tests to cover the new methods but I got some urgent stuff to do at work and this is taking all the time I had to work on this MR.

As soon as things go back to normal at work I will add tests.

txgruppi avatar Jun 03 '22 15:06 txgruppi

(it was to resolve conflicts, not useless don't worry Delyan :))

Hunam6 avatar Jun 09 '22 16:06 Hunam6

This one has been out a while - any further thoughts?

JalonSolov avatar Dec 17 '22 01:12 JalonSolov

6 conflicting files need to be fixed, now.

JalonSolov avatar Aug 06 '23 15:08 JalonSolov