pisa icon indicating copy to clipboard operation
pisa copied to clipboard

Query container

Open elshize opened this issue 5 years ago • 3 comments

This is still work in progress but you might have comments already.

Basically, I introduce QueryContainer which is data that can be read from input and written to output (more info later) and QueryRequest (names might differ at the end) that is the object that you'd pass to the processing algorithm.

The motivation is to keep track of queries and their related features. So for example, you can pass query and its threshold together, so that there's no accidental shift when passed two files (or even two files could be by accident completely unrelated, say, wrong set of thresholds, mixed from another experiment). This also allows to pipe stuff along.

The second major concern is having data from two different tasks that are not aligned. This might not be visible right away now but will become clearer when I introduce some additional algorithms and bigrams. Anyway, here's the gist: say, you have a query, a b c a. Clearly, you will not open four posting lists when you start the processing. So when you pass it to the query, it becomes a b c. But it's possible that you want to bring some additional term-related information from a different tool, be it within PISA or not. So I think it's important to have a deterministic way of, say, producing a list of term IDs. I figured it would be good to separate all query information that would be taken in and out, from the data that would be only used internally by the algorithms.

Anyway, this is my first take on it, let me know what you think, and I'll keep thinking about it as well. But this is really important for the algorithms I want to port from v1, and would also simplify and improve our command line greatly.

elshize avatar Apr 26 '20 23:04 elshize

Let me explain why we want to use a pointer in QueryContainer (the same goes for the term resolver, although to a lesser extend, as in it's not quite as important there as it is in this case). There are two major reasons, and let me tackle it one by one.

Move overhead.

Once you start putting many members in a structure, it gets quite big. Any time you move it, all of that data needs to be copied. I'm referring to the members, i.e., std::optional, not the vector data. So each time you move a query, you copy five fields. So imagine you have a vector of queries and you want to sort them. Each swap will have to move about 4 pointers, 4 sizes and one float.

And although I agree this on its own could be less of an issue considering that this is not the most important, it strengthens the following argument.

Compilation

Let me start by saying that I know that query container itself is a small object that doesn't require much time to compile. Although the argument that you need to compile it about two dozen times is still valid, this is not about that.

The real problem that we've struggled for a long time is that each small change causes recompilation of so much code that is not as fast to compile. Consider this: query is an essential part that will be used in a majority of compilation units. Each time you change something minor in the implementation, it will need to recompile most of the binaries. And yes, you still need to link it, but that's considerably faster than parsing and generating template code and everything else is so time consuming. And yes, the header still might change but it will change much less often than the implementation. So I say we should put to such hot header as query.hpp as little as we need to, which means also the definition of types we don't need to expose anywhere don't need to be in the header.

Downsides

All that said, let's talk about the actual downsides. Here's the additional boilerplate:

  • Header
    1. One line of forward declaration.
    2. std::unique_ptr field (but replacing a number of fields)
  • Cpp
    1. Additional struct QueryContainerInner { and } lines (all inside would be just in a different place).
    2. Three additional lines for copy constructor and copy assignment operator compared to = default.
    3. m_data indirection when accessing data in methods.

Note that outside of that one cpp file you never deal with the fact that there's any kind of pointer inside it. The class still has normal value semantics, move is faster, copies have exactly the same semantics, there's literally nothing tricky about using this class.

The only real downside is m_data indirection and it is in one small file only. It doesn't affect any other piece of code.

As far as "additional complexity" argument goes, I don't buy it because it just makes certain fields private. And if one wants to explore the implementation, it just goes one level down, there's no tricks here, it's just a pointer, it's not rocket science. Especially if you compare to some stuff we do with templates, this is just so damn uncomplicated.

elshize avatar May 01 '20 12:05 elshize

I thought about having threshold, and I agree that it's not ideal just having a threshold. I think we should have thresholds:

thresholds: [
    {"k": 10, "score": 5.3},
    {"k": 100, "score": 3.3},
    {"k": 1000, "score": 1.3},
]

This way, if we ask to run an algorithm that depends on a threshold for k = 20 and we don't have these thresholds, then we know to fail and what to print out as an error.

Note that just like we can pass term lexicon and parser and stemmer to produce IDs, we should be eventually be able to ask to estimate (one way or another) the thresholds if they are missing or if we want to force re-computation.

elshize avatar May 01 '20 13:05 elshize

Codecov Report

Merging #373 into master will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #373   +/-   ##
=======================================
  Coverage   92.42%   92.42%           
=======================================
  Files          91       93    +2     
  Lines        4921     4923    +2     
=======================================
+ Hits         4548     4550    +2     
  Misses        373      373           
Impacted Files Coverage Δ
include/pisa/query.hpp 100.00% <100.00%> (ø)
include/pisa/query/query_parser.hpp 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update dde6095...8e6da84. Read the comment docs.

codecov[bot] avatar Jun 15 '20 18:06 codecov[bot]