core-libraries-committee icon indicating copy to clipboard operation
core-libraries-committee copied to clipboard

Add Data.List.compareLength

Open Bodigrim opened this issue 1 year ago • 18 comments

Let's add to Data.List a new function:

compareLength :: [a] -> Int -> Ordering
compareLength = foldr (\_ f m -> if 0 > m then GT else f (m - 1)) (compare 0)

such that

compareLength xs c = compare (length xs) c

This function gives the same answer as comparing against the result of length, but can short circuit if the count of elements is greater than the number, and hence be more efficient. It is also productive on infinite inputs, for which compare (length xs) c would loop infinitely:

> compareLength [] 0 
EQ 
> compareLength [] 1
LT
> compareLength ['a'] 1
EQ
> compareLength ['a', 'b'] 1
GT
> compareLength [0..] 100
GT

Seasoned Haskellers know that the same effect can be achieved with genericLength and Peano numbers (https://chrisdone.com/posts/lazy-list-length/), but let's not pretend this to be ergonomic or efficient.

Prior art:

Bodigrim avatar Feb 21 '24 22:02 Bodigrim

  1. Is the other argument order also seen in the wild? I would have expected the list to be last.
  2. An equally important function is the continuous extension of compare `on` length. I believe we should consider the names of both functions at the same time.

treeowl avatar Feb 21 '24 23:02 treeowl

Is the other argument order also seen in the wild? I would have expected the list to be last.

No, I cannot find any examples for Int -> [a] -> Ordering.

An equally important function is the continuous extension of compare `on` length. I believe we should consider the names of both functions at the same time.

Prior art in extra suggests comparingLength :: [a] -> [a] -> Ordering.

Bodigrim avatar Feb 21 '24 23:02 Bodigrim

Another function whose name I think we should have in the backs of our minds when considering this proposal is the one checking whether the length of a list is at least a particular number, and the two-list variants of that one (i.e., the extensions of (>=) `on` length and (>=) `on` length, extra lazy in the first and second argument, respectively). This zoo is out there in the wild (e.g., the GHC source code), but I don't know if anyone's come up with a nice, consistent, naming scheme.

Edit: similarly, less than and greater than.

treeowl avatar Feb 21 '24 23:02 treeowl

similar: https://hackage.haskell.org/package/utility-ht-0.0.17.1/docs/Data-List-Match.html#v:compareLength

A simple implementation would be:

compareLength :: [a] -> Int -> Ordering
compareLength xs n = compare (void xs) (replicate n ())

thielema avatar Feb 21 '24 23:02 thielema

I'm not sure I want to bikeshed naming of other similar functions right now. There is consistent prior art to use compareLength for [a] -> Int -> Ordering, and I'm happy to leave further developments to others. I do not think that such developments should steer us away from a well-attested name.

With regards to lazy analogs of length xs {<,<=,>,>=} c: while they could be marginally better than compareLength in certain scenarios, I'll leave it to others to convince CLC members that they are worth inclusion into base. IMO compareLength solves 95% of issues with performance and laziness.


compareLength xs n = compare (void xs) (replicate n ()) is neat, but foldr in the top post is much better in terms of performance. It compiles to a tight little loop without any extra allocations:

compareLength1 :: forall {a}. [a] -> Int# -> Ordering
compareLength1
  = \ (@a_s13T) (ds_s13L :: [a_s13T]) (ww_s13O :: Int#) ->
      case ds_s13L of {
        [] ->
          case <# 0# ww_s13O of {
            __DEFAULT ->
              case ww_s13O of {
                __DEFAULT -> GT;
                0# -> EQ
              };
            1# -> LT
          };
        : y_a13H ys_a13I ->
          case ># 0# ww_s13O of {
            __DEFAULT -> compareLength1 ys_a13I (-# ww_s13O 1#);
            1# -> GT
          }
      }

Bodigrim avatar Feb 21 '24 23:02 Bodigrim

This seems like a solid improvement and follows prior art. I'm in favor.

parsonsmatt avatar Feb 22 '24 00:02 parsonsmatt

What should be the result of compareLength undefined (-1)?

For Data.ByteString.Lazy.compareLength it is GT. For Data.Text.compareLength and Data.List.Extra.compareLength it is undefined.

In other words, should we be lazy in the list if the length is negative?

meooow25 avatar Feb 22 '24 01:02 meooow25

There seem to be three sensible functions with this name:

compareLength_L :: Int -> [a] -> Ordering
compareLength_R :: [a] -> Int -> Ordering
compareLength_C :: [a] -> [a] -> Ordering

and no clear winner. Of compareLength_L and compareLength_R I would prefer compareLength_L because most often the Int argument will be a constant, which allows for partial application like in map (compareLength 1).

I guess you want to save an import of a utility library. However the price to pay is to restrict your code to a recent version GHC. I would not add such a function to base.

thielema avatar Feb 22 '24 08:02 thielema

I'm finding it hard to understand where this would be used. Could someone point to examples in the wild?

tomjaguarpaw avatar Feb 22 '24 08:02 tomjaguarpaw

I'm finding it hard to understand where this would be used. Could someone point to examples in the wild?

I don't remember the names of the functions, or the module in which they're defined, but GHC uses several such functions. Vague recollection: try git grep lengthAtLeast.

treeowl avatar Feb 22 '24 08:02 treeowl

The text and bytestring versions of this function come with an extensive set of rewrite rules, should these be added here too?

And what about generalising to Foldable and/or Num?

sjoerdvisscher avatar Feb 22 '24 13:02 sjoerdvisscher

As @treeowl mentions, GHC uses:

lengthAtMost :: [a] -> Int -> Bool 
lengthIsNot :: [a] -> Int -> Bool 
lengthIs :: [a] -> Int -> Bool  
lengthAtLeast :: [a] -> Int -> Bool 
lengthExceeds :: [a] -> Int -> Bool 
lengthLessThan :: [a] -> Int -> Bool 

compareLength :: [a] -> [b] -> Ordering 
equalLength :: [a] -> [b] -> Bool 

These are used to avoid forcing lazy lists or to avoid traversing potentially long lists.

hsyl20 avatar Feb 22 '24 14:02 hsyl20

What should be the result of compareLength undefined (-1)? ... In other words, should we be lazy in the list if the length is negative?

@meooow25 A weak precedent is that take is lazy in the list if the length is non-positive:

> take 0 undefined
[]

Ideally I'd love to have compareLength :: [a] -> Word -> Ordering, which does not pose this kind of tricky questions, but it would be out of line with the rest of Data.List API.


I guess you want to save an import of a utility library. However the price to pay is to restrict your code to a recent version GHC. I would not add such a function to base.

@thielema this seems to be a general argument against adding anything ever to base, isn't it? Is this proposal somehow particularly aggravating?

I want base to exhibit best practices and help users to avoid common peformance / divergence pitfalls. Users can continue using utility libraries such as extra, which will conditionally re-export a definition from base, so there is no price to pay in terms of restricting to new GHCs only. We followed this pattern already for several functions, e. g., (!?) and unsnoc.


@tomjaguarpaw this proposals stems from our discussion with @Noughtmare at https://github.com/kowainik/stan/issues/550.

Bodigrim avatar Feb 24 '24 00:02 Bodigrim

@Bodigrim that makes sense. But the proposed implementation needs some changes if it is to be lazy.

meooow25 avatar Feb 24 '24 05:02 meooow25

Since GHC uses compareLength :: [a] -> [b] -> Ordering (since things with same type only are comparable) and lengthSomething :: [a] -> Int -> Bool , it is good to name them:

lengthGuess :: [a] -> Int -> Ordering
-- or
guessLength :: [a] -> Int -> Ordering


compareLength :: [a] -> [b] -> Ordering 

VitWW avatar Feb 26 '24 06:02 VitWW

I still like compareLength more because it is the concatenation of the two functions you would otherwise use: compare (length xs) n. I'd use comparingLength for the function that GHC calls compareLength, because that's again the concatenation of the two functions you'd use: comparing length xs ys

noughtmare avatar Feb 26 '24 12:02 noughtmare

@sjoerdvisscher excellent questions!

The text and bytestring versions of this function come with an extensive set of rewrite rules, should these be added here too?

Rules rewriting compare (length xs) n to compareLength xs n could change semantics, making diverging programs terminate. While I'd expect such change to be a desirable one in the majority of cases, I'd rather left it to a (potential) subsequent proposal.

In theory users might use length xs as a poor man's deepseq and expect that after compare (length xs) n the spine of xs has been forced. Is it a reasonable expectation?.. Could it be crucial property for certain programs not to explode?.. Let me leave it to someone else to explore.

And what about generalising to Foldable [...]?

While the definition above indeed works for any Foldable, it's hard to tell whether compareLength is better than (compare .) . length for a given Foldable or no. If length is O(1) then compare (length xs) n is faster than compareLength xs n which is O(n).

Note that introduction of a monomorphic Data.List.compareLength does not preclude anyone from raising a proposal to add Data.Foldable.compareLength or whatever they fancy to argue for.

I think a good addition to the proposal would be Data.List.NonEmpty.compareLength, because we know that length is inefficient for NonEmpty too.

And what about generalising to [...] Num?

It would be quite attractive for two reasons:

  • to use Word instead of Int so that negative numbers are not an option,
  • to replace compare (genericLength xs) (10 :: Int8) (which does not do what you think) with compareLength xs (10 :: Int8) (which does what you think).

Yet I imagine that in the majority of cases n in compareLength xs n is a literal constant, so the change would force everyone to write an explicit type as in compareLength xs (10 :: Int), which will be annoying soon. So I think I'd stick to monomorphic Int, continuing the notorious tradition of Data.List.

Bodigrim avatar Feb 29 '24 20:02 Bodigrim

@Bodigrim do you have an MR prepared for this function? More than once since you brought it up have I wished we already had it in base.

mixphix avatar Mar 22 '24 15:03 mixphix

I've updated the proposal and linked the MR.

Bodigrim avatar Jun 09 '24 16:06 Bodigrim

I suggest a couple of changes.

 compareLength :: [a] -> Int -> Ordering
 compareLength xs n
   | n < 0 = GT
-  | otherwise = foldr (\_ f m -> if 0 > m then GT else f (m - 1)) (compare 0) xs n
+  | otherwise = foldr (\_ f m -> if m > 0 then f (m - 1) else GT) (\m -> if m > 0 then LT else EQ) xs n
  • In the fold function, m being 0 is enough to tell us that the list is longer.
  • Given that m is never negative, the final check does not need to test for GT because it's impossible.

meooow25 avatar Jun 10 '24 13:06 meooow25

@meooow25 could you possibly elaborate what's improved by the suggested change?

Bodigrim avatar Jun 10 '24 20:06 Bodigrim

Sure,

  • If we stop at m == 0, we have to look at one less element to conclude with a result. This is observable with an example like
    λ> foldr (\_ f m -> if 0 > m then GT else f (m - 1)) (compare 0) (1:2:undefined) 1
    *** Exception: Prelude.undefined
    ...
    λ> foldr (\_ f m -> if m > 0 then f (m - 1) else GT) (compare 0) (1:2:undefined) 1
    GT
    
  • compare 0 is short for (an equivalent of) \m -> if 0 < m then LT else (if 0 == m then EQ else GT). With the above change, the last branch which requires m < 0 is impossible. So the 0 == m test will always be true and can be removed.

meooow25 avatar Jun 10 '24 21:06 meooow25

@meooow25 thanks, I've update the MR. Does it look good now?

Bodigrim avatar Jun 15 '24 21:06 Bodigrim

Looks good, thanks!

I am wondering if it would be useful to INLINE compareLength for list fusion, but I am not convinced. Usually you would want to do something with the list after checking the length, which would kill fusion. I leave this to someone with a stronger opinion.

meooow25 avatar Jun 15 '24 22:06 meooow25

My preference is to apply {-# INLINE #-} only if well justified, so I'd rather leave it to another proposal / proposer to investigate.

Bodigrim avatar Jun 15 '24 22:06 Bodigrim

I'll trigger a vote soon unless there are more comments / suggestions.

Bodigrim avatar Jun 18 '24 22:06 Bodigrim

Dear CLC members, let's vote on the proposal to add Data.List{,.NonEmpty}.compareLength :: [a] -> Int -> Ordering, which is a safer and faster alternative to compare (length xs) n. Please refer to the top post for more details, examples and prior art in existing libraries.

@hasufell @velveteer @mixphix @tomjaguarpaw @angerman @parsonsmatt


+1 from me, unsurprisingly.

Bodigrim avatar Jun 21 '24 19:06 Bodigrim

+1, anything that reduces length on lists is a win

parsonsmatt avatar Jun 21 '24 20:06 parsonsmatt

+1

velveteer avatar Jun 21 '24 21:06 velveteer

@mixphix @hasufell @tomjaguarpaw @angerman just a gentle reminder to vote.

Bodigrim avatar Jun 23 '24 18:06 Bodigrim