bigflow icon indicating copy to clipboard operation
bigflow copied to clipboard

希望增加group_by之后的统计函数?

Open linearhinos opened this issue 7 years ago • 4 comments

在对key group_by之后,希望可以方便做求均值,求方差,排序再遍历这样的操作; 希望可以提供类似这样的内置函数

linearhinos avatar Dec 22 '17 11:12 linearhinos

I can not understand what you mean。

yshysh avatar Dec 25 '17 08:12 yshysh

i mean, in addition to sum(), count(), could bigflow support mean()/variance() and other popular statistical function for PCollection ?

linearhinos avatar Dec 25 '17 09:12 linearhinos

Actually, you can use:

def mean(p):
    return p.sum() / p.count()   
    # this is a sugar for p.sum().map(lambda s, c: s / c, p.count())

to implement mean in one line.

then, you can use them in apply_values, e.g.

p.group_by_key()\
  .apply_values(mean)

At the same time, if you want to use it to a global pcollection, you can just use apply:

p.apply(mean) 

or just call it directly:

mean(p)

Because it's easy to implement these functions, so we don't regard them as built-in methods.

If you find it difficult to write these functions, you can always use transforms.make_tuple(pobject1, pobject2). E.g. You can use transforms.make_tuple to implement mean like this:

def mean(p):
    return transforms.make_tuple(p.sum(), p.count()).map(lambda (s, c): s/c)

And you can implement a method to get both sum and mean, and use them in apply_values like this:

def sum_and_mean(p):
    return transforms.make_tuple(p.sum(), p.apply(mean))

p.group_by_key().apply_values(sum_and_mean)

acmol avatar Dec 25 '17 09:12 acmol

I think there should be a module to provide available or useful functions.

chunyang-wen avatar Dec 31 '17 14:12 chunyang-wen