pack2 icon indicating copy to clipboard operation
pack2 copied to clipboard

new tool: merge frequency counts

Open roycewilliams opened this issue 4 years ago • 0 comments

Expanding to the more general case mentioned in https://github.com/hops/pack2/issues/8#issuecomment-626259377, it would be very useful to have an optimized tool to efficiently merge frequency-count data.

The use case is merging frequency counts across large datasets, and incrementally adding new frequency counts over time as new data is discovered. Calculating a frequency count for a delta or a new data source, and then merging it with an existing frequency count, is significantly more efficient than recalculating the entire frequency count.

The uniq -c format (integer frequency count, a space, and the item being counted) is the most obvious case, but other formats could be supported.

It would be nice to be able to assume that the list is sorted by the item being counted, but the implementation should assume that it's not. Or, perhaps, like rli vs rli2, one version that does not assume sorting but is memory-bound, and another version that has no size limits but requires sorted input (or a flag to switch between the two).

Reference awk implementation is here.

roycewilliams avatar May 12 '20 15:05 roycewilliams