ngrammatic icon indicating copy to clipboard operation
ngrammatic copied to clipboard

Adding limit parameter + serde + updated readme + const arity

Open LucaCappelletti94 opened this issue 1 year ago • 12 comments

With this pull request:

  1. I take care of issue #9
  2. Added some more documentation
  3. Integrated GitHub action-based clippy and test suite
  4. Integrated badges and now README is part of documentation, so there cannot be examples in the README that are not working. Note that the Github action badge will only start to work once you merge this pull request, as it points to a currently not existing action.

Cheers!

LucaCappelletti94 avatar Mar 31 '24 17:03 LucaCappelletti94

Now I have also added support for serde.

LucaCappelletti94 avatar Mar 31 '24 20:03 LucaCappelletti94

And now added const generic, similarly to what you are doing in this pull request. Updated also all tests, which are all passing.

LucaCappelletti94 avatar Mar 31 '24 21:03 LucaCappelletti94

I'm pretty happy with the changes overall, just some minor fixes for outdated comments, and trying to close the (unlikely) gap of a 0-arity ngram.

compenguy avatar Apr 01 '24 17:04 compenguy

I am trying to understand which areas that require a non-trivial amount of memory may be trimmed a bit.

For instance, I believe there is some redundancy in the hashmap strings, and having both the text and padded text. Likely, using a trie data structure may significantly reduce the memory footprint of the hashmap.

First thing, I am setting up some benchmarks so that I understand better where I am starting with the optimization. Do you have any particular insight regarding what may be trimmed first?

7.875 GB 100.00% ⏺: ngrammatic::Corpus<ngrammatic::key_transformer::LinkedKeyTransformer<ngrammatic::key_transformer::IdentityKeyTransformer, ngrammatic::key_transformer::LowerKeyTransformer>, 2>
   24  B   0.00% ├╴pad_left: ngrammatic::Pad
                 │ ╰╴Variant: Auto
   24  B   0.00% ├╴pad_right: ngrammatic::Pad
                 │ ╰╴Variant: Auto
4.365 GB  55.43% ├╴ngrams: std::collections::hash::map::HashMap<alloc::string::String, ngrammatic::Ngram<2>>
3.510 GB  44.57% ├╴gram_to_words: std::collections::hash::map::HashMap<alloc::string::String, alloc::vec::Vec<alloc::string::String>>
    0  B   0.00% ╰╴key_transformer: ngrammatic::key_transformer::LinkedKeyTransformer<ngrammatic::key_transformer::IdentityKeyTransformer, ngrammatic::key_transformer::LowerKeyTransformer>
    0  B   0.00%   ├╴src: ngrammatic::key_transformer::IdentityKeyTransformer
    0  B   0.00%   ╰╴dst: ngrammatic::key_transformer::LowerKeyTransformer

LucaCappelletti94 avatar Apr 01 '24 17:04 LucaCappelletti94

@compenguy if you are curious, you can take a look at the ongoing benchmarks here: https://github.com/LucaCappelletti94/ngrammatic/tree/master/benchmarks

LucaCappelletti94 avatar Apr 02 '24 20:04 LucaCappelletti94

We are now down from the initial 7.875 GB to 439.6 MB! Still a lot of space at the bottom.

LucaCappelletti94 avatar Apr 05 '24 08:04 LucaCappelletti94

@compenguy if you are curious, you can take a look at the ongoing benchmarks here: https://github.com/LucaCappelletti94/ngrammatic/tree/master/benchmarks

I see benchmark times for building the corpus, but how does this affect lookups? I would generally expect corpus building to be a mostly one time during startup type of task where bandwidth matters, while lookups tend to be tied into user interaction, and would be more latency sensitive.

compenguy avatar Apr 05 '24 18:04 compenguy

I am first focusing on getting the file size down to something manageable. Also the queries are now much faster, but I still have to do a comprehensive benchmark for them. Right now I am experimenting with a webgraph data structure, we will see how far down we can get.

LucaCappelletti94 avatar Apr 05 '24 21:04 LucaCappelletti94

Hi @compenguy! The optimization is done, I think. Do check out the search time benchmarks here and the memory benchmarks here. I only need to rewrite the README describing a bit more of the new stack, and then it is ready.

LucaCappelletti94 avatar Apr 11 '24 13:04 LucaCappelletti94

README is now updated. Let me know what you think of the whole thing!

LucaCappelletti94 avatar Apr 11 '24 15:04 LucaCappelletti94

It looks like there are merge conflicts that need resolving. I haven't yet looked at the conflicts, buy the commit history shows the evolution of your thinking, with your mistakes and mistteps along the way, which tends to make resolving conflicts a lot more work.

I'm going to ask that you either rebase your work to resolve the conflicts, or rework your commit history so that each commit reflects a single logical change that as you mean for it to be in the final state. I would prefer the latter - it's much kinder not just on code reviewers, but on anyone who needs to read through the code history - but that's a lot of work and I didn't advise you of this preference earlier on.

compenguy avatar Apr 13 '24 22:04 compenguy

Apologies for the merge conflicts; I underestimated their likelihood with GitHub pull requests as it did not report conflicts there, but there were most definitely some as I started with the rebase.

The rebase is now complete. However, restructuring the commit history would be quite labour-intensive, so I've avoided it for now.

I've organized logical change commits for the initial phase. However, the optimization stage necessitated a more experimental approach. You can find detailed notes on my experiments in the benchmarks.

LucaCappelletti94 avatar Apr 14 '24 10:04 LucaCappelletti94

I merged this PR to make it easier for you to work on. There's still issues with it, including test failures:

failures:
    src/corpus_par_from.rs - corpus_par_from::Corpus<KS,NG,K,WeightedBitFieldBipartiteGraph>::par_from (line 111)
    src/lib.rs - (line 103)
    src/lib.rs - (line 158)
    src/lib.rs - (line 187)
    src/lib.rs - (line 217)
    src/lib.rs - (line 246)
    src/lib.rs - (line 280)
    src/lib.rs - (line 301)
    src/lib.rs - (line 343)
    src/lib.rs - (line 375)
    src/lib.rs - (line 471)
    src/lib.rs - (line 580)
    src/lib.rs - (line 79)

test result: FAILED. 119 passed; 13 failed; 0 ignored; 0 measured; 0 filtered out; finished in 11.39s

Once things are clean and looking good, I'll publish it to crates.io. If this goes silent for like 3 months, then I'll probably revert this.

compenguy avatar Jun 15 '24 17:06 compenguy

Ok! I will look into the issues mentioned.

LucaCappelletti94 avatar Jul 10 '24 08:07 LucaCappelletti94