stream-lib icon indicating copy to clipboard operation
stream-lib copied to clipboard

Recordinality implementation

Open positiveblue opened this issue 9 years ago • 11 comments

  • I have implemented Recordinality, a stream algorithm. Recordinality class extends from ICardinality and Serialize. You should check the serialize process and the merge (return an exeption because Recordinality doesn't support it).
  • I have implemented some tests for the Recordinality class.
  • ReadMe modified, now with Recordinality paper information.

positiveblue avatar Jan 22 '16 11:01 positiveblue

The algorithm and implementation look broadly correct, but there's some formatting and minor implementation details that would be nice to be addressed. eg. If merge isn't supported, it is fine to just throw an unsupported operation exception.

One note though: Based on the paper, it seems like recordinality is pretty straightforwardly inferior to the existing hllp algorithm. Maybe if your values are naturally uniformly random longs, you could skip hashing them and thereby preserve a sample set, but that seems a bit outlandish.

I'm not opposed to including even if only for completionist or academic reasons (there seems to be a variety of suggested areas of further research), however, it seems appropriate document the class accordingly.

tea-dragon avatar Jan 27 '16 01:01 tea-dragon

I don't mind handling such mundane things directly, so if you'd rather, I can just pull and then clean it up myself. Just lmk

tea-dragon avatar Jan 27 '16 01:01 tea-dragon

Hello Ian,

Thanks for taking a look at my code. I have written a basic version of Recordinality. The main reason has been that this pull request is to learn how the library work, which is the process to contribute, etc...

As you have said, it's possible to run the algorithm without hash the item before and it could be useful for strings for example. However, there are more Recordinality variations that could be interesting.For instance, Adaptive Recordinality (AR) has the feature that let them increase the memory usage if it's necessary. If you don't know which is te maximum cardinality you can run AR with only one position and let them increase if it's necesary.

I'm going to address the format and the merge 'errors' and after that will resubmit the pull request. After this, I will iterate more times adding some features (if it's right).

Thanks.

positiveblue avatar Jan 27 '16 13:01 positiveblue

Well,

I have reformated the code and added the UnsoprttedOperationException() to the merge function. However I have some questions about the format tools: do you use some tool for check the format? I am not a Java programmer, now I am using IntelliJ but I don't know the ecosystem.

If there are some problems with this implementation let me know. As I told, I will implement the adaptive version in a near future.

Thanks.

positiveblue avatar Feb 03 '16 09:02 positiveblue

We use intellij (long live it's glory!), so we mostly just have a formatting style we import/export to share. Deviating from it is fine, but it usually covers most minor issues in an unobjectionable manor. I'll make sure a recent version is available somewhere -- at that point you can just import settings from the jar and it should properly create an "AddThis" codestyle without interfering with other things.

tea-dragon avatar Feb 03 '16 15:02 tea-dragon

As a quick kinda hint, generally the start of a file goes "copywriter/license headers", "package/imports", "class javadoc". I always have intellij fold (aka kinda hide) those first two components, so it works out fine in practice too.

tea-dragon avatar Feb 03 '16 15:02 tea-dragon

(That said, today is kind of tightly scheduled -- think I'm already late for things -- so you might not see results from my promises until tomorrow)

tea-dragon avatar Feb 03 '16 15:02 tea-dragon

Looks like all tests are passing. Can you take a look at an example file for formatting and headers? https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/ConcurrentStreamSummary.java The copyright should be at the top of the doc (before package)... thanks!

abramsm avatar Mar 24 '17 12:03 abramsm

I will recheck the headers this evening (PTZ).

positiveblue avatar Mar 27 '17 20:03 positiveblue

Thanks, also suggest moving this to the same 'research' package mentioned for the HyperBitBit since it doesn't seem likely that this would be a preferable option to HLLP in production uses. Let me know if you feel differently.

abramsm avatar Mar 28 '17 11:03 abramsm

I agree with your suggestion.

Anyway, let me reimplement Recordinality and open some discussion on the mailing list because I think it has some "killer features" that could be useful.

positiveblue avatar Mar 28 '17 19:03 positiveblue