stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

Make `list.unique` logarithmic instead of quadratic

Open radekm opened this issue 1 year ago • 3 comments
trafficstars

Fixes https://github.com/gleam-lang/stdlib/issues/667. ~New list.unique implementation uses gleam/set to detect whether item has already been seen.~ New list.unique implementation uses gleam/dict to detect whether item has already been seen (it can't use gleam/set because that would create import cycle).

radekm avatar Aug 14 '24 09:08 radekm

But unfortunately there's an import cycle.

The import statements for these modules form a cycle:

    ┌─────┐
    │     gleam/result
    │     ↓
    │     gleam/set
    │     ↓
    │     gleam/list
    └─────┘

radekm avatar Aug 14 '24 09:08 radekm

Looks like you have a compile error!

lpil avatar Aug 14 '24 09:08 lpil

So I have to come up with a different solution

radekm avatar Aug 14 '24 09:08 radekm

This looks good to me! I've left a small note inline. Also could you update the changelog?

giacomocavalieri avatar Nov 12 '24 22:11 giacomocavalieri

Thanks for the review. I have fixed the comment and added an entry to the changelog.

radekm avatar Nov 13 '24 10:11 radekm

Minor suggestion: consider using a Set rather than a Dict. I know Set uses Dict under the covers, but conceptually you are using your Dict as a Set, so just use a Set.

chuckwondo avatar Nov 13 '24 10:11 chuckwondo

It can't use a set since that would cause a dependency cycle and those are not allowed in Gleam. Using a Dict directly sounds like the best option here

giacomocavalieri avatar Nov 13 '24 11:11 giacomocavalieri

It can't use a set since that would cause a dependency cycle and those are not allowed in Gleam. Using a Dict directly sounds like the best option here

Ah, thanks for the clarification. Perhaps adding a comment to the code about that point would prevent future readers from making the same comment/question.

chuckwondo avatar Nov 13 '24 11:11 chuckwondo

Yeah good idea!

giacomocavalieri avatar Nov 13 '24 11:11 giacomocavalieri

Thanks, comment added

radekm avatar Nov 13 '24 11:11 radekm

Thank you. How does this version benchmark compared to the previous one? It would be fab to see what this tool returns for lists of different sizes.

lpil avatar Nov 14 '24 16:11 lpil

Are you still working on this @radekm ? Thank you

lpil avatar Dec 07 '24 21:12 lpil

@lpil I did the benchmarking using glychee and these are the results on a list of growing size of unique elements (the worst case for the algorithm as all elements are unique):

old unique
   size        ips     average     99th %     memory
    100   25700.00    38.91 μs   42.96 μs  155.23 KB
   1000     202.02     4.95 ms    5.06 ms   13.94 MB
 10_000       2.15   465.51 ms  473.69 ms    1.37 GB
100_000       0.02    53.04 s    53.04 s   140.07 GB

new unique
   size        ips     average     99th %     memory
    100  155200.00     6.44 μs    8.96 μs   28.88 KB
   1000    9470.00   105.63 μs  198.97 μs  342.01 KB
 10_000     558.32     1.79 ms    2.25 ms    4.24 MB
100_000      36.29    27.56 ms   36.97 ms   53.56 MB

This is a huge improvement! Just for fun I tried implementing this without using fold and with an explicit _loop function seeing if it would make any difference. These are the results, a tiny bit faster and less memory hungry but nothing too crazy. Even if it's not a crazy improvement over the fold-ing one I'd still go for this version with the _loop one.

new unique (_loop version)
   size        ips     average     99th %     memory
    100  162930.00     6.14 μs    8.38 μs   26.03 KB
   1000   10920.00    91.58 μs  164.96 μs  319.03 KB
 10_000     556.32     1.80 ms    2.24 ms    4.00 MB
100_000      37.43    26.72 ms   36.45 ms   51.73 MB
pub fn unique(list: List(a)) -> List(a) {
  unique_loop(list, dict.new(), [])
}

fn unique_loop(list: List(a), seen: Dict(a, Nil), acc: List(a)) -> List(a) {
  case list {
    [] -> reverse(acc)
    [first, ..rest] ->
      case dict.has_key(seen, first) {
        True -> unique_loop(rest, seen, acc)
        False ->
          unique_loop(rest, dict.insert(seen, first, Nil), [first, ..acc])
      }
  }
}

giacomocavalieri avatar Dec 18 '24 17:12 giacomocavalieri

Would anyone like to finish up this PR? 🙏

lpil avatar Jan 23 '25 18:01 lpil

I can pick this up!

giacomocavalieri avatar Jan 23 '25 18:01 giacomocavalieri

Sorry, at the moment I have a lot of other work.

radekm avatar Jan 24 '25 20:01 radekm

Thank you both!!

lpil avatar Jan 25 '25 19:01 lpil