sampleclean-async
sampleclean-async copied to clipboard
Crowd Context
Include other cols in the task.
This is actually hard to do, since the current code applies a distinct count first and then runs attrdedup
Hm. Could we rewrite the initial count distinct query as a group by?
e.g. SELECT name, first(col1), first(col2), ... FROM t GROUP BY name
This requires spark SQL to have a first
aggregate, or some other way of getting a value out of the group.