bigflow
bigflow copied to clipboard
希望增加group_by之后的统计函数?
在对key group_by之后,希望可以方便做求均值,求方差,排序再遍历这样的操作; 希望可以提供类似这样的内置函数
I can not understand what you mean。
i mean, in addition to sum(), count(), could bigflow support mean()/variance() and other popular statistical function for PCollection ?
Actually, you can use:
def mean(p):
return p.sum() / p.count()
# this is a sugar for p.sum().map(lambda s, c: s / c, p.count())
to implement mean
in one line.
then, you can use them in apply_values
,
e.g.
p.group_by_key()\
.apply_values(mean)
At the same time, if you want to use it to a global pcollection, you can just use apply
:
p.apply(mean)
or just call it directly:
mean(p)
Because it's easy to implement these functions, so we don't regard them as built-in methods.
If you find it difficult to write these functions, you can always use transforms.make_tuple(pobject1, pobject2)
.
E.g. You can use transforms.make_tuple
to implement mean
like this:
def mean(p):
return transforms.make_tuple(p.sum(), p.count()).map(lambda (s, c): s/c)
And you can implement a method to get both sum and mean, and use them in apply_values
like this:
def sum_and_mean(p):
return transforms.make_tuple(p.sum(), p.apply(mean))
p.group_by_key().apply_values(sum_and_mean)
I think there should be a module to provide available or useful functions.