allpairspy icon indicating copy to clipboard operation
allpairspy copied to clipboard

Fix filtering and uneven parameter distribution

Open pavelicii opened this issue 2 years ago • 11 comments

Current problems:

  1. Filtering removes valid pairs. More info in this issue.
  2. Uneven distribution of parameters. More info in this issue.
  3. When n > 2, resulting combinations are not complete.

I think I was able to fix them by tuning the scanning algorithm a bit. I totally see that probably it could be improved even more, so feel free to ask any questions or demand any additional changes.

I will write detailed comments in the Changes dialog.

pavelicii avatar Dec 10 '21 19:12 pavelicii

Quick update: I found one more issue. I'm fixing it in my free time and plan to commit it in the next 1-2 weeks together with requested changes.

pavelicii avatar Jan 21 '22 09:01 pavelicii

So I swapped last 2 weights for better results.

--

Unfortunately, I wasn't able to find a way to fix an issue I found after applying all my changes. Let me try to explain the issue.

As I said in this comment, in order for the algorithm to correctly cover all pairs when filtering is applied, we need to go back to scanning if currently chosen combination doesn't represent new pairs. This way we will either get new pairs, or brute force all items and raise StopIteration() when we got back to i == 0.

The problem comes when there are too many parameters. In my testings, starting from 8 parameters with 8 values each (and more) AND a filter which restricts 1 pair, the brute force would take minutes to finish. I didn't even wait once until it finish, but it is for sure more than 10 minutes which I don't think is good. For example:

parameters = [
    ("1", ["1-1", "1-2", "1-3", "1-4", "1-5", "1-6", "1-7", "1-8"]),
    ("2", ["2-1", "2-2", "2-3", "2-4", "2-5", "2-6", "2-7", "2-8"]),
    ("3", ["3-1", "3-2", "3-3", "3-4", "3-5", "3-6", "3-7", "3-8"]),
    ("4", ["4-1", "4-2", "4-3", "4-4", "4-5", "4-6", "4-7", "4-8"]),
    ("5", ["5-1", "5-2", "5-3", "5-4", "5-5", "5-6", "5-7", "5-8"]),
    ("6", ["6-1", "6-2", "6-3", "6-4", "6-5", "6-6", "6-7", "6-8"]),
    ("7", ["7-1", "7-2", "7-3", "7-4", "7-5", "7-6", "7-7", "7-8"]),
    ("8", ["8-1", "8-2", "8-3", "8-4", "8-5", "8-6", "8-7", "8-8"]),
]

rules = [
    lambda d: "1-1" == d["1"] and "2-1" == d["2"],
]

Run this with only 6 parameters and it will finish in less than a second.

--

I still believe the fix to the current AllPairs algorithm is needed since now filtering works completely incorrect (removing valid pairs from the resulting set). This PR fixes it, but introduces the issue of the algorithm working for too long on large input data. So you either ship it with this flaw, or wait until someone else finds a way to make the described case faster.

If anything is unclear, feel free to ask! I'm still very open to finishing this PR, especially if there are any ideas on how to fix the issue.

pavelicii avatar Feb 15 '22 15:02 pavelicii

Hi @thombashi, @pavelicii,

I recently stubled across the same issue. I can confirm this patch fixes it.

For my matrix (5x26x4) it still shows overrepresentation of some values but at least all the pairs are now covered.

Thank you very much for providing the fix. Any idea by when you are able to update PyPI?

Cheers, Jonatan

JonatanAntoni avatar Feb 22 '22 16:02 JonatanAntoni

@pavelicii Thank you for your investigation and explanations.

One possible approach is to add an option to AllPairs class that can choose an algorithm. In this way, users can select suitable algorithms for their needs like fast (but uneven parameter distribution) or precise (but may slow for large inputs).

thombashi avatar Feb 27 '22 04:02 thombashi

@thombashi Solution with fast and precise is OK, I could implement it.

However, I also came up with another solution which could potentially be better, but it would require major version change since it would introduce breaking changes. Let me explain.

I will refer to the problematic input which I provided earlier:

parameters = [
    ("1", ["1-1", "1-2", "1-3", "1-4", "1-5", "1-6", "1-7", "1-8"]),
    ("2", ["2-1", "2-2", "2-3", "2-4", "2-5", "2-6", "2-7", "2-8"]),
    ("3", ["3-1", "3-2", "3-3", "3-4", "3-5", "3-6", "3-7", "3-8"]),
    ("4", ["4-1", "4-2", "4-3", "4-4", "4-5", "4-6", "4-7", "4-8"]),
    ("5", ["5-1", "5-2", "5-3", "5-4", "5-5", "5-6", "5-7", "5-8"]),
    ("6", ["6-1", "6-2", "6-3", "6-4", "6-5", "6-6", "6-7", "6-8"]),
    ("7", ["7-1", "7-2", "7-3", "7-4", "7-5", "7-6", "7-7", "7-8"]),
    ("8", ["8-1", "8-2", "8-3", "8-4", "8-5", "8-6", "8-7", "8-8"]),
]

rules = [
    lambda d: "1-1" == d["1"] and "2-1" == d["2"],
]

It has 1792 unique pairs. This number is stored in max_unique_pairs_expected variable and the algorithm constantly checks if we reached this number. We filter out 1 pair using rules, so when the algorithm finds 1791 unique pairs, it hangs on brute forcing for the last pair. But in fact we know that there us no need to brute force for the pair which we excluded.

Here's what we could do to tell the algorithm no to do so. Before generating cases, we can create a list of dictionaries of all possible pairs. It can be easily done with Python, check this SO answer. And then we could run our rules against this list to filter out unwanted pairs. After it is filtered, we can update max_unique_pairs_expected. In this case it will be 1791 so the algorithm won't hang anymore.

Why it would be a breaking change? To be able to run rules against all possible pairs list, parameters should be named, so they should be created as dictionaries. Currently there is an opportunity to provide parameters without names. So if we apply what I suggest, this wouldn't be possible anymore (to be honest, I don't think this should be possible at all, because it is very inconvenient to describe rules for unnamed parameters).

Let me know what do you think.

pavelicii avatar Mar 12 '22 17:03 pavelicii

@pavelicii Thank you for your ideas.

In my opinion, breaking changes would be allowed if it is needed My concern is that applying a filter for all possible combinations might require a long processing time, especially for large inputs (like 1 billion possible combinations).

thombashi avatar Mar 20 '22 10:03 thombashi

It is a good fix and works well. When can it be merged in?

Huang-Jack avatar Nov 01 '23 08:11 Huang-Jack

@pavelicii Are there any update on this?

thombashi avatar Nov 03 '23 02:11 thombashi

Coverage Status

coverage: 93.814% (+2.6%) from 91.192% when pulling 4a2bdfdc811593e1233a320bee57fca2fd999204 on pavelicii:fix/filtering-and-distribution into 415a2d2c17196e3ace30f4eced8d3bf69b093c7e on thombashi:master.

coveralls avatar Nov 03 '23 02:11 coveralls

@thombashi Yes, I stumbled upon this once again and I plan to implement what I suggested in this comment, hopefully in the following 1-2 weeks. I implemented this solution in my Java port more than a year ago and it is working fine with few remarks (will try to cover everything in the readme). I will also try to avoid changing major version, but will see, I don't remember why I assumed it will be needed.

About your comment on '1 billion possible combinations'. Here by 'combinations' we mean n-wise possible combinations, for example possible pairs (not to confuse it with Cartesian product of all parameters). Such amount is very hard to hit, and I don't think the algorithm was ready for it before anyway, so I believe it shouldn't be a blocker.

UPD: Sorry, I didn't have much time recently, but it is still in my TODOs.

pavelicii avatar Nov 03 '23 09:11 pavelicii

@pavelicii Thank you for your answer. I'm looking forward to your update.

@JonatanAntoni @Huang-Jack Please wait for a little while longer.

thombashi avatar Nov 05 '23 11:11 thombashi