stdlib Make `list.unique` logarithmic instead of quadratic

trafficstars

Fixes https://github.com/gleam-lang/stdlib/issues/667. ~New list.unique implementation uses gleam/set to detect whether item has already been seen.~ New list.unique implementation uses gleam/dict to detect whether item has already been seen (it can't use gleam/set because that would create import cycle).

Aug 14 '24 09:08 radekm

But unfortunately there's an import cycle.

The import statements for these modules form a cycle:

    ┌─────┐
    │     gleam/result
    │     ↓
    │     gleam/set
    │     ↓
    │     gleam/list
    └─────┘

Aug 14 '24 09:08 radekm

Looks like you have a compile error!

Aug 14 '24 09:08 lpil

So I have to come up with a different solution

Aug 14 '24 09:08 radekm

This looks good to me! I've left a small note inline. Also could you update the changelog?

Nov 12 '24 22:11 giacomocavalieri

Thanks for the review. I have fixed the comment and added an entry to the changelog.

Nov 13 '24 10:11 radekm

Minor suggestion: consider using a Set rather than a Dict. I know Set uses Dict under the covers, but conceptually you are using your Dict as a Set, so just use a Set.

Nov 13 '24 10:11 chuckwondo

It can't use a set since that would cause a dependency cycle and those are not allowed in Gleam. Using a Dict directly sounds like the best option here

Nov 13 '24 11:11 giacomocavalieri

It can't use a set since that would cause a dependency cycle and those are not allowed in Gleam. Using a Dict directly sounds like the best option here

Ah, thanks for the clarification. Perhaps adding a comment to the code about that point would prevent future readers from making the same comment/question.

Nov 13 '24 11:11 chuckwondo

Yeah good idea!

Nov 13 '24 11:11 giacomocavalieri

Thanks, comment added

Nov 13 '24 11:11 radekm

Thank you. How does this version benchmark compared to the previous one? It would be fab to see what this tool returns for lists of different sizes.

Nov 14 '24 16:11 lpil

Are you still working on this @radekm ? Thank you

Dec 07 '24 21:12 lpil

@lpil I did the benchmarking using glychee and these are the results on a list of growing size of unique elements (the worst case for the algorithm as all elements are unique):

old unique
   size        ips     average     99th %     memory
    100   25700.00    38.91 μs   42.96 μs  155.23 KB
   1000     202.02     4.95 ms    5.06 ms   13.94 MB
 10_000       2.15   465.51 ms  473.69 ms    1.37 GB
100_000       0.02    53.04 s    53.04 s   140.07 GB

new unique
   size        ips     average     99th %     memory
    100  155200.00     6.44 μs    8.96 μs   28.88 KB
   1000    9470.00   105.63 μs  198.97 μs  342.01 KB
 10_000     558.32     1.79 ms    2.25 ms    4.24 MB
100_000      36.29    27.56 ms   36.97 ms   53.56 MB

This is a huge improvement! Just for fun I tried implementing this without using fold and with an explicit _loop function seeing if it would make any difference. These are the results, a tiny bit faster and less memory hungry but nothing too crazy. Even if it's not a crazy improvement over the fold-ing one I'd still go for this version with the _loop one.

new unique (_loop version)
   size        ips     average     99th %     memory
    100  162930.00     6.14 μs    8.38 μs   26.03 KB
   1000   10920.00    91.58 μs  164.96 μs  319.03 KB
 10_000     556.32     1.80 ms    2.24 ms    4.00 MB
100_000      37.43    26.72 ms   36.45 ms   51.73 MB

pub fn unique(list: List(a)) -> List(a) {
  unique_loop(list, dict.new(), [])
}

fn unique_loop(list: List(a), seen: Dict(a, Nil), acc: List(a)) -> List(a) {
  case list {
    [] -> reverse(acc)
    [first, ..rest] ->
      case dict.has_key(seen, first) {
        True -> unique_loop(rest, seen, acc)
        False ->
          unique_loop(rest, dict.insert(seen, first, Nil), [first, ..acc])
      }
  }
}

Dec 18 '24 17:12 giacomocavalieri

Would anyone like to finish up this PR? 🙏

Jan 23 '25 18:01 lpil

I can pick this up!

Jan 23 '25 18:01 giacomocavalieri

Sorry, at the moment I have a lot of other work.

Jan 24 '25 20:01 radekm

Thank you both!!

Jan 25 '25 19:01 lpil

stdlib stdlib copied to clipboard

Make `list.unique` logarithmic instead of quadratic

stdlib
stdlib copied to clipboard