actual icon indicating copy to clipboard operation
actual copied to clipboard

[WIP] fix(#2562): Prevent transaction deduplication for imported transactions

Open ttlgeek opened this issue 10 months ago • 7 comments

Hello,

After looking into the issue, it seems that the transaction deduplication is due to the two transactions having the same amount and account and the second transaction happened within 7 days of the first one.

After reading the import transaction code, the fuzzy matching matching uses the 7 days method to try to match similar transactions and merge them.

Proposed solution: prevent the fuzzy matching from matching imported transactions (where import_id is not null) to prevent this behavior.

@MatissJanis what do you think ?

ttlgeek avatar Apr 16 '24 23:04 ttlgeek

Deploy Preview for actualbudget ready!

Name Link
Latest commit b54fab0f365b31865a71917aca38ccb4685e09b3
Latest deploy log https://app.netlify.com/sites/actualbudget/deploys/663b56fa5374fd000809d798
Deploy Preview https://deploy-preview-2618.demo.actualbudget.org
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Apr 16 '24 23:04 netlify[bot]

Bundle Stats — desktop-client

Hey there, this message comes from a GitHub action that helps you and reviewers to understand how these changes affect the size of this project's bundle.

As this PR is updated, I'll keep you updated on how the bundle size is impacted.

Total

Files count Total bundle size % Changed
9 4.71 MB 0%

Changeset

No files were changed

View detailed bundle breakdown

Added

No assets were added

Removed

No assets were removed

Bigger

No assets were bigger

Smaller

No assets were smaller

Unchanged

Asset File Size % Changed
static/js/indexeddb-main-thread-worker-e59fee74.js 13.5 kB 0%
static/js/resize-observer.js 18.37 kB 0%
static/js/usePreviewTransactions.js 790 B 0%
static/js/BackgroundImage.js 122.29 kB 0%
static/js/AppliedFilters.js 21.16 kB 0%
static/js/narrow.js 59.81 kB 0%
static/js/wide.js 262.38 kB 0%
static/js/ReportRouter.js 1.23 MB 0%
static/js/index.js 3 MB 0%

github-actions[bot] avatar Apr 16 '24 23:04 github-actions[bot]

Bundle Stats — loot-core

Hey there, this message comes from a GitHub action that helps you and reviewers to understand how these changes affect the size of this project's bundle.

As this PR is updated, I'll keep you updated on how the bundle size is impacted.

Total

Files count Total bundle size % Changed
1 1.2 MB → 1.2 MB (+24 B) +0.00%
Changeset
File Δ Size
packages/loot-core/src/server/accounts/sync.ts 📈 +24 B (+0.11%) 22.13 kB → 22.16 kB
View detailed bundle breakdown

Added

No assets were added

Removed

No assets were removed

Bigger

Asset File Size % Changed
kcab.worker.js 1.2 MB → 1.2 MB (+24 B) +0.00%

Smaller

No assets were smaller

Unchanged

No assets were unchanged

github-actions[bot] avatar Apr 16 '24 23:04 github-actions[bot]

Hi, I took a look at this PR (context: I'm the author of #2562). I understand MatissJanis' point, and I believe the issue could be approached in a different way to avoid breaking existing use cases.

Do not completely exclude transactions that have a imported_id != NULL from the deduplication logic. Instead, skip the deduplication of two transactions only if both transactions have an imported_id set and the IDs are different.

(On a side note, I believe reconciled transactions should be excluded from deduplication altogether, instead.)

I think this would both solve the issue I reported and avoid breaking the other use case. Hopefully it makes sense.

jacopo-j avatar Apr 20 '24 13:04 jacopo-j

Do not completely exclude transactions that have a imported_id != NULL from the deduplication logic. Instead, skip the deduplication of two transactions only if both transactions have an imported_id set and the IDs are different.

I think this would work. Can't come up with any counter arguments for this at the moment.

MatissJanis avatar Apr 20 '24 17:04 MatissJanis

Instead, skip the deduplication of two transactions only if both transactions have an imported_id set and the IDs are different.

This sounds good to me. When I import from an OFX file, what I'm expecting is an option to disable deduplication heuristics, not to disable deduplication entirely.

strazto avatar May 07 '24 07:05 strazto

@ttlgeek Just wondering if you were going to continue this work as suggested by the comments above. If not I can give it a go. It would be my first contribution so would probably take longer, but got to start somewhere. Let me know what your plans are.

pmoon00 avatar May 10 '24 04:05 pmoon00

@pmoon00

@ttlgeek Just wondering if you were going to continue this work as suggested by the comments above. If not I can give it a go.

Looks like for whatever reason he's not coming back to it anytime soon. Possibly he's deployed his forked branch to his own instance and is happy with its' behaviour.

It would be my first contribution so would probably take longer, but got to start somewhere.

This PR / branch is probably still good basis for your work btw - Take the work here and then apply @MatissJanis + @jacopo-j 's comments (Should mainly be a change to the SQL query that Matiss commented on) and you shouldn't have to do too much to get approved.

strazto avatar May 16 '24 03:05 strazto

In fact, I may also have a go at this over my lunch break now

strazto avatar May 16 '24 03:05 strazto

@jacopo-j re-reading your comment

Do not completely exclude transactions that have a imported_id != NULL from the deduplication logic.

Instead, skip the deduplication of two transactions only if both transactions have an imported_id set and the IDs are different.

I think @ttlgeek 's PR already does what you're suggesting, it's just not obvious from the default diff. Expanding the context of the diff yields the following:

https://github.com/actualbudget/actual/blob/e5933ad8cde43802160f9f35237a4898b172610c/packages/loot-core/src/server/accounts/sync.ts#L380-L402

To rephrase your specification:

Do not completely exclude transactions that have a imported_id != NULL from the deduplication logic. Instead, skip the deduplication of two transactions only if both transactions have an imported_id set and the IDs are different.

if tx_in has imported_id:
  if tx_in.imported_id corresponds to existing_tx:
    dedup using IDs
    exit
    
# In the above, both transactions have imported ID, but the IDs match, so no skip

# The following logic must hold for the the rest of this flow:
# tx_in has no imported_id
# OR
# tx_in's imported_id does not match (ids are different)

if an existing_tx has no imported_id:
  dedup using heuristics
  exit

# (
# tx_in has no imported_id
# OR
# tx_in.imported_id does not match (ids are different)
#  )
# AND existing_tx has no imported_id 

This behaviour is consistent with @ttlgeek 's PR.

strazto avatar May 16 '24 04:05 strazto

I was trying to poke holes in the logic, but couldn't find any issues. Thanks for correcting me!

This morning I thought about it and had a feeling I may be overlooking something. It's definitely tricky inverting the logic between what the SQL statements are doing and the idea of "when to skip".

I think possibly there's an issue with this being the skip case:

# (
# tx_in has no imported_id
# OR
# tx_in.imported_id does not match (ids are different)
#  )
# AND existing_tx has no imported_id 

Because that includes tx_in.imported_id == NULL && existing_tx.imported_id == NULL which is NOT a skip case. Your instinct to double check is valid.

I think the change is small, however, we can just add another condition to ttlgeek's original SQL modification.

In terms of style, I think we should have linebreaks that delineate the logic more clearly in the SQL statements, and comments,

`SELECT id, is_parent, date, imported_id, payee, category, notes, reconciled FROM v_transactions
   WHERE date >= ? AND date <= ? AND amount = ? AND account = ? 
   -- Only perform it if either ID is null
   AND ( imported_id IS NULL OR ? IS NULL)`, //  tx_in.imported_id should be last parameter

strazto avatar May 17 '24 07:05 strazto

I was trying to poke holes in the logic, but couldn't find any issues. Thanks for correcting me!

This morning I thought about it and had a feeling I may be overlooking something.

It's definitely tricky inverting the logic between what the SQL statements are doing and the idea of "when to skip".

I think possibly there's an issue with this being the skip case:


# (

# tx_in has no imported_id

# OR

# tx_in.imported_id does not match (ids are different)

#  )

# AND existing_tx has no imported_id 

Because that includes tx_in.imported_id == NULL && existing_tx.imported_id == NULL which is NOT a skip case.

Your instinct to double check is valid.

I think the change is small, however, we can just add another condition to ttlgeek's original SQL modification.

In terms of style, I think we should have linebreaks that delineate the logic more clearly in the SQL statements, and comments,


`SELECT id, is_parent, date, imported_id, payee, category, notes, reconciled FROM v_transactions

   WHERE date >= ? AND date <= ? AND amount = ? AND account = ? 

   -- Only perform it if either ID is null

   AND ( imported_id IS NULL OR ? IS NULL)`, //  tx_in.imported_id should be last parameter

Would it be clearer if the AND imported_id IS NULL predicate is only included if the tx_in.imported_id is set/exists?

I'm not sure what the conventions around dynamic SQL predicates are in this project though.

pmoon00 avatar May 17 '24 08:05 pmoon00

Would it be clearer if the AND imported_id IS NULL predicate is only included if the tx_in.imported_id is set/exists?

I'm not sure what the conventions around dynamic SQL predicates are in this project though.

Different branching logic overall might be preferable and more readable, however my assumption is that dynamic SQL should be avoided. It's more likely to lead to bugs and could potentially become a vector for injection vulns (not very likely in this case).

It's true that this statement slightly violates DRY with the code branch that checks that tx_in.imported_id is null, but I don't think it's a big problem.

AND ( imported_id IS NULL OR ? IS NULL)

strazto avatar May 17 '24 08:05 strazto

Hi @MatissJanis @pmoon00 I've taken a stab at this in my own PR #2770 which bases off of @ttlgeek 's branch (though without all the master -> feature merge commits because I think those are ugly).

The only difference is the logic of the SQL statement, which now allows the dedup to occur if an existing transaction OR the input transaction are null. There's also a formatting difference in the SQL statement (I've re-ordered it and added linebreaks + comments to logically separate the fuzzy logic from the null conditioning )

strazto avatar May 17 '24 10:05 strazto

Closing in favour of https://github.com/actualbudget/actual/pull/2770

MatissJanis avatar May 29 '24 19:05 MatissJanis