beancount-import icon indicating copy to clipboard operation
beancount-import copied to clipboard

Matching runs forever / Webserver unresponsive

Open moritzj29 opened this issue 2 years ago • 21 comments

Hi,

I'm trying to import a Paypal transaction, but it reliably breaks the Webserver. It is a transaction containing quite a number of items. If I reduce the number of items, the Webserver remains responsive. Increasing the number of items included in the JSON increases the time needed for processing enormously. It seems to scale in a highly non-linear way.

I attached the problematic JSON, personal information was removed: Paypal_test_json.txt

I tried tracking the time consuming method in the code. I found that _get_valid_posting_matches is called over and over again. I printed the length of the matches returned and it fluctuates heavily, increasing up to 6000?! My beancount file has about 200 transactions at the moment...

https://github.com/jbms/beancount-import/blob/dca340694ac60dd8ee7529cc4740b09fe729dbc2/beancount_import/matching.py#L1438

Let's see how long the processing will take, I'm afraid I will have to leave it running overnight...

I would really appreciate some hints on debugging! Unfortunately, I don't understand the code well enough yet.

As a workaround I can probably just remove the items from the JSON by hand?

moritzj29 avatar Dec 14 '21 20:12 moritzj29

Got next candidates in 3453.0893 seconds so it took about 1 hour for the transaction. And the Webserver became responsive again. But is this normal behavior?

moritzj29 avatar Dec 15 '21 05:12 moritzj29

Some portions of the matching algorithm are unfortunately exponential in the number of postings in the transactions being considered for a given match. Due to the nature of the problem I don't think we can avoid that. However, there are certain limits in the code intended to limit the total time spent, but it sounds like some additional limits need to be added.

Can you create a test case for matching_test.py to demonstrate the problem? Ideally you could create a test case that is a bit smaller than your actual problematic case, so that matching takes e.g. 10 seconds but not 3000 seconds.

jbms avatar Dec 15 '21 05:12 jbms

thanks for your fast reply! I will try creating a test case which takes around 10-15s!

moritzj29 avatar Dec 15 '21 06:12 moritzj29

I'm also experiencing this issue, although with a different source.

Zburatorul avatar Jan 13 '22 19:01 Zburatorul

I tried creating a suitable test case but had no success... I still have the original Paypal transaction file, but apparently I should have also copied my beancount journal file back then. With an empty/generic journal I cannot reproduce this.

If I stumble upon this anytime in the future, I will try again to create a robust test case. For the time being, I'm fine with closing this. Any objections?

moritzj29 avatar Mar 22 '22 20:03 moritzj29

I think this is a regression that has been introduced in the last year.

I go through all my transactions once a year and had no problems last year. This year, I opened beancount-import and even without changing the data, it freezes up as soon as I try to add a transaction.

bobobo1618 avatar Apr 01 '22 09:04 bobobo1618

These symptoms do not necessarily point to a regression, because they depend on the state of your journal and your source files.

If you rolled back your journal and sources a year, and then you experienced the unresponsiveness, that would confirm a regresion.

Either way, what we need is a sample file that demonstrates the issue.

Zburatorul avatar Apr 01 '22 17:04 Zburatorul

I'm again stumbeling upon this issue... I have a Paypal transaction with 42 items and processing takes again very long (not yet finished...). Since the problem appears to depend not only on the imported transaction but also on the complete beancount journal, I think it is quite difficult to share. It contains a lot of sensitive data and changing it too much will probably change the behaviour...

But anyway, if someone has an idea of what to try or change in the code, I will happily help with debugging!

moritzj29 avatar May 29 '22 15:05 moritzj29

I've done a little bit of analysis on this and it seems that new transactions with a large amount of items run into a search problem across the posting_db.

Right now each match looks at each individual posting, extracts all appropriate transactions from the posting_db and creates a list of matches. Those are then yielded. This is an O(n*m) search where n is the number of postings to be matched and m is the number of transactions for the day.

I started sketching out an algorithm that would be O(m+n) where you build a dictionary keyed on the search parameters (account and date mostly) that loops over all transactions across all posting days and finds all matching postings in one pass.

I just need to unravel how postings are matched to replicate a similar set of responses with way fewer loops. It will be limited by the number of transactions per day, but that's way more manageable.

xentac avatar Aug 23 '22 21:08 xentac

Thanks --- one thing that would be very helpful is if you can first create a test case to be added to matching_test.py that reproduces the problem. That will make it much easier to find and evaluate potential solutions --- for example we can use the profiler to determine where the time is being spent.

jbms avatar Aug 24 '22 00:08 jbms

I noticed that the matching_test.py fixtures don't create the same kind of posting_db as a full reconciliation. Specifically I believe that https://github.com/jbms/beancount-import/blob/master/beancount_import/reconcile.py#L679 will add imported files to the posting_db which could also contribute to the explosion of search area.

I looked into the reconcile test and it seems a lot of those files were created using the testdata/source directories but I can't find references to how I would add new ones and generate new reconcile files. Suggestions are welcome how to approach that.

xentac avatar Aug 28 '22 23:08 xentac

I was able to create a minimal repro with a single amazon invoice (I added others for later improvement and then deleted them in the next commit): https://github.com/xentac/beancount-import/tree/repro

We try to match 12956 matchable postings and the search algorithm is slow.

xentac avatar Aug 29 '22 05:08 xentac

Thanks, that is very helpful. Note that in matching_test.py, the tests do add entries to the posting_db ---- that is what the journal parameter is for.

jbms avatar Aug 29 '22 05:08 jbms

The vast majority of the processing time happens within _get_matches, searching for matching amounts. I wrote a quick and dirty amount bucketed posting search (https://github.com/xentac/beancount-import/commit/701cde5a89a977cf0e0a5f4e941045e39ab56dde). This reduced the single source processing time from 4:40 down to 15 seconds.

I'm going to run the larger repro test and see if there are other quick ways to improve posting amount searching.

xentac avatar Aug 29 '22 17:08 xentac

That's great. I think an alternative to bucketing by amount is to bucket by currency, and within each currency bucket, sort the list by amount. That way a binary search can be used to find all postings within the fuzzy match amount threshold.

jbms avatar Aug 29 '22 17:08 jbms

Even with a bucketed search, it still takes too long to iterate. I think any solution we come up with will have to only visit each matchable posting once and preferably only visit each posting_db entry once.

My current implementation plan is separate postings by date and currency, order them by amount number, and create a buffer on each side that will contain all amounts within the fuzzy_match_amount range. Then we need to have two iterators over matchable_postings and postings per day, only touching each item one time.

I'll have to solve problems around memory allocation, dynamic ranges, and when to push and pop items from buffers.

It would be nice if we could come up with an algorithm with a smaller search space, but I don't think I understand the code well enough to think that through.

xentac avatar Aug 29 '22 19:08 xentac

I now have a test that implements the iterator method. It takes processing time of a single medium sized invoice (12000+ MatchablePostings) from 4:40 to 2 seconds.

I can pretty much guarantee my test has an off-by-one error in the iterator code that also needs to be figured out.

Unfortunately I'm running into a problem with larger invoice (635440 MatchablePostings) during initialization. My theory is that the bisect.insort is too slow on that many elements, but I have to verify it.

xentac avatar Aug 29 '22 23:08 xentac

The large number of matchable postings comes from the fact that we don't just match individual postings, but also match groups of various subsets of postings, because particularly with Amazon invoices, the Amazon invoice may show multiple credit card transactions but your bank will show them as one transaction. Perhaps for large transactions though we should better limit the number of subsets that are considered for matching.

jbms avatar Aug 31 '22 17:08 jbms

That's probably a good idea, though I don't have a good sense for how many subsets is an appropriate amount to search through. Are you suggesting using fewer than 4 postings per matchable postings for larger invoices or some other method?

I was considering also adding an upper limit to transaction matching. I would rather have a posting in my journal that I can't process than having the UI freeze and not be able to move forward in processing invoices.

xentac avatar Aug 31 '22 17:08 xentac

I think it would make sense to start with subsets of 1, then add in subsets of 2, etc. until some (possibly configurable) limit on the number of MatchablePostings is reached, e.g. 10000 or 50000.

jbms avatar Aug 31 '22 19:08 jbms

related PR: #174 Thanks for looking into that and coming up with a solution @xentac !

moritzj29 avatar Jan 08 '23 06:01 moritzj29