urlwatch icon indicating copy to clipboard operation
urlwatch copied to clipboard

Reporting blanks

Open jkmaxwell opened this issue 8 months ago • 10 comments

I believe there is an error happening but I cannot confirm it and am therefore considering it a possible bug.

Summary: I am getting empty notifications. I believe I should not be, given my parameters.

urlwatch.yaml

Relevant sections

display:
  empty-diff: false
  error: true
  new: true
  unchanged: false

Reporting via email and pushover

urls.yaml

name: "Dahlias Maks Georgia"
navigate: "https://lobaughsdahlias.com/product/maks-georgia-2020-introduction/"
browser: chromium
wait_until: load
filter:
  - css:
      selector: div.column.purchase-options-container.span-6
  - html2text
  - strip
diff_filter:
  - grep: -v 'Out of stock'
diff_tool: "diff --ignore-all-space --unified"

(i already tried with url instead of playwright, but i'm trying to outrun the dynamic content and it seems impossible)

Logs

test-diff-filter

xxx@xxx ~ % urlwatch --test-diff-filter 2
=== Filtered diff between state 0 and state 1 ===

=== Filtered diff between state 1 and state 2 ===

=== Filtered diff between state 2 and state 3 ===

=== Filtered diff between state 3 and state 4 ===

=== Filtered diff between state 4 and state 5 ===

=== Filtered diff between state 5 and state 6 ===

test-filter

xxx@xxx ~ % urlwatch --test-filter 2     
Maks Georgia
$16.00
Out of stock
		SKU: 107
	Category: Dahlia

--verbose

xxx@xxx ~ % urlwatch --verbose           
2023-12-02 21:18:05,236 cli INFO: turning on verbose logging mode
2023-12-02 21:18:05,243 minidb DEBUG: PRAGMA table_info(CacheEntry)
2023-12-02 21:18:05,243 main INFO: Using /Users/xxx/Library/Application Support/urlwatch/urls.yaml as URLs file
2023-12-02 21:18:05,243 main INFO: Using /Users/xxx/Library/Application Support/urlwatch/hooks.py for hooks
2023-12-02 21:18:05,243 main INFO: Using /Users/xxx/Library/Caches/urlwatch/cache.db as cache database
2023-12-02 21:18:05,246 main INFO: Found 3 jobs
2023-12-02 21:18:05,246 worker DEBUG: Processing 3 jobs (out of 3)
2023-12-02 21:18:05,247 handler INFO: Processing: <url url='https://www.mayberryblooms.com/product/mayberry-s-pink-champagne-limit-one-per-household/291?cs=true&cst=custom' name='Dahlias Mayberry' filter=[{'css': {'selector': 'div.product-inventory__wrapper'}}] max_tries=5>
2023-12-02 21:18:05,247 handler INFO: Processing: <browser navigate='https://lobaughsdahlias.com/product/maks-georgia-2020-introduction/' wait_until='load' browser='chromium' name='Dahlias Maks Georgia' filter=[{'css': {'selector': 'div.column.purchase-options-container.span-6'}}, 'html2text', 'strip'] diff_tool='diff --ignore-all-space --unified' diff_filter=[{'grep': "-v 'Out of stock'"}]>
2023-12-02 21:18:05,247 minidb DEBUG: SELECT data, timestamp, tries, etag FROM CacheEntry WHERE guid = ? ORDER BY timestamp DESC, tries DESC LIMIT ? ['67733641008777777d1b0700d290e41ff938b9f4', 1]
2023-12-02 21:18:05,247 handler INFO: Processing: <browser navigate='https://lobaughsdahlias.com/product/newaukum-honey/' wait_until='load' browser='chromium' name='Dahlias Honey' filter=[{'css': {'selector': 'div.column.purchase-options-container.span-6'}}, 'html2text', 'strip'] diff_tool='diff --ignore-all-space --unified' diff_filter=[{'grep': "-v 'Out of stock'"}]>
2023-12-02 21:18:05,249 minidb DEBUG: SELECT data, timestamp, tries, etag FROM CacheEntry WHERE guid = ? ORDER BY timestamp DESC, tries DESC LIMIT ? ['56cc3f3526d3800b442c5dc93dfd8779f1356c2a', 1]
2023-12-02 21:18:05,251 minidb DEBUG: SELECT data, timestamp, tries, etag FROM CacheEntry WHERE guid = ? ORDER BY timestamp DESC, tries DESC LIMIT ? ['5cf8173e2609fe5b1bc182edae8db9031eb440fb', 1]
2023-12-02 21:18:05,253 connectionpool DEBUG: Starting new HTTPS connection (1): www.mayberryblooms.com:443
2023-12-02 21:18:05,291 selector_events DEBUG: Using selector: KqueueSelector
2023-12-02 21:18:05,291 selector_events DEBUG: Using selector: KqueueSelector
2023-12-02 21:18:05,559 connectionpool DEBUG: https://www.mayberryblooms.com:443 "GET /product/mayberry-s-pink-champagne-limit-one-per-household/291?cs=true&cst=custom HTTP/1.1" 200 None
2023-12-02 21:18:05,572 filters INFO: Applying filter 'css', subfilter {'selector': 'div.product-inventory__wrapper'} to https://www.mayberryblooms.com/product/mayberry-s-pink-champagne-limit-one-per-household/291?cs=true&cst=custom
2023-12-02 21:18:05,574 worker DEBUG: Job finished: <url url='https://www.mayberryblooms.com/product/mayberry-s-pink-champagne-limit-one-per-household/291?cs=true&cst=custom' method='GET' name='Dahlias Mayberry' filter=[{'css': {'selector': 'div.product-inventory__wrapper'}}] max_tries=5>
2023-12-02 21:18:05,574 worker DEBUG: Using max_tries of 5 for <url url='https://www.mayberryblooms.com/product/mayberry-s-pink-champagne-limit-one-per-household/291?cs=true&cst=custom' method='GET' name='Dahlias Mayberry' filter=[{'css': {'selector': 'div.product-inventory__wrapper'}}] max_tries=5>
2023-12-02 21:18:07,182 filters INFO: Applying filter 'css', subfilter {'selector': 'div.column.purchase-options-container.span-6'} to https://lobaughsdahlias.com/product/newaukum-honey/
2023-12-02 21:18:07,185 filters INFO: Applying filter 'html2text', subfilter {} to https://lobaughsdahlias.com/product/newaukum-honey/
2023-12-02 21:18:07,185 filters INFO: Applying filter 'strip', subfilter {} to https://lobaughsdahlias.com/product/newaukum-honey/
2023-12-02 21:18:07,185 worker DEBUG: Job finished: <browser navigate='https://lobaughsdahlias.com/product/newaukum-honey/' wait_until='load' browser='chromium' name='Dahlias Honey' filter=[{'css': {'selector': 'div.column.purchase-options-container.span-6'}}, 'html2text', 'strip'] diff_tool='diff --ignore-all-space --unified' diff_filter=[{'grep': "-v 'Out of stock'"}]>
2023-12-02 21:18:07,185 worker DEBUG: Using max_tries of 0 for <browser navigate='https://lobaughsdahlias.com/product/newaukum-honey/' wait_until='load' browser='chromium' name='Dahlias Honey' filter=[{'css': {'selector': 'div.column.purchase-options-container.span-6'}}, 'html2text', 'strip'] diff_tool='diff --ignore-all-space --unified' diff_filter=[{'grep': "-v 'Out of stock'"}]>
2023-12-02 21:18:07,374 filters INFO: Applying filter 'css', subfilter {'selector': 'div.column.purchase-options-container.span-6'} to https://lobaughsdahlias.com/product/maks-georgia-2020-introduction/
2023-12-02 21:18:07,377 filters INFO: Applying filter 'html2text', subfilter {} to https://lobaughsdahlias.com/product/maks-georgia-2020-introduction/
2023-12-02 21:18:07,377 filters INFO: Applying filter 'strip', subfilter {} to https://lobaughsdahlias.com/product/maks-georgia-2020-introduction/
2023-12-02 21:18:07,377 worker DEBUG: Job finished: <browser navigate='https://lobaughsdahlias.com/product/maks-georgia-2020-introduction/' wait_until='load' browser='chromium' name='Dahlias Maks Georgia' filter=[{'css': {'selector': 'div.column.purchase-options-container.span-6'}}, 'html2text', 'strip'] diff_tool='diff --ignore-all-space --unified' diff_filter=[{'grep': "-v 'Out of stock'"}]>
2023-12-02 21:18:07,377 worker DEBUG: Using max_tries of 0 for <browser navigate='https://lobaughsdahlias.com/product/maks-georgia-2020-introduction/' wait_until='load' browser='chromium' name='Dahlias Maks Georgia' filter=[{'css': {'selector': 'div.column.purchase-options-container.span-6'}}, 'html2text', 'strip'] diff_tool='diff --ignore-all-space --unified' diff_filter=[{'grep': "-v 'Out of stock'"}]>
2023-12-02 21:18:07,377 reporters INFO: Submitting with stdout (<class 'urlwatch.reporters.StdoutReporter'>)
2023-12-02 21:18:07,377 reporters INFO: Submitting with email (<class 'urlwatch.reporters.EMailReporter'>)
2023-12-02 21:18:07,378 reporters DEBUG: Not sending e-mail (no changes)
2023-12-02 21:18:07,378 reporters INFO: Submitting with ifttt (<class 'urlwatch.reporters.IFTTTReport'>)
2023-12-02 21:18:07,378 reporters INFO: Submitting with pushover (<class 'urlwatch.reporters.PushoverReport'>)
2023-12-02 21:18:07,378 reporters DEBUG: Not sending pushover (no changes)

Problem

However, I continually get positive hits and these generate reporter actions. Why is it reporting them as new?

===========================================================================
01. NEW: Dahlias Maks Georgia
02. NEW: Dahlias Honey
===========================================================================

---------------------------------------------------------------------------
NEW: Dahlias Maks Georgia ( https://lobaughsdahlias.com/product/maks-georgia-2020-introduction/ )
---------------------------------------------------------------------------


---------------------------------------------------------------------------
NEW: Dahlias Honey ( https://lobaughsdahlias.com/product/newaukum-honey/ )
---------------------------------------------------------------------------


-- 
urlwatch 2.28, Copyright 2008-2023 Thomas Perl
Website: https://thp.io/2008/urlwatch/
Support urlwatch development: https://github.com/sponsors/thp
watched 3 URLs in 2 seconds

Thank you!

jkmaxwell avatar Dec 03 '23 05:12 jkmaxwell

Maybe it is a cache problem? There are two things confusing me:

  1. why are the reports calling these new every time when they are not new?
  2. why is the report reporting on blank information when my criteria should be preventing that?

Thanks.

jkmaxwell avatar Dec 03 '23 17:12 jkmaxwell

urlwatch stores its history in directories related to cache (see e.g. #416). If you have a program on your machine which regularly cleans cache directories, this currently will clean out the history.

"NEW" in that report means that there is no previous entry for it. Otherwise the report would mention "CHANGED".

However, it's weird that --test-diff-filter does show history entries being available.

After you see the "NEW" report, does --test-diff-filter still work or does it complain that there are no historic entries?

thp avatar Dec 03 '23 20:12 thp

Thanks @thp ! I don't have anything running that i know of that is cleaning the cache (no cleanmymac). --test-diff-filter works every time, which is why this is so weird. It has never complained about any lack of historic entries. I don't understand why it both registers it as new every time and reports historic entries.

FWIW it seems to be only with these https://lobaughsdahlias.com/ ecommerce urls. I've been using urlwatch successfully on a few other urls for a month or so and it works as expected

jkmaxwell avatar Dec 03 '23 23:12 jkmaxwell

@jkmaxwell how did you install urlwatch? i've been having sporadic alarms about NEW items which are not new (and there is no cache autoclean, especially bc if there was all jobs would alert when one job is new. in this case only a subset of the jobs alert as new).

I've installed urlwatch using rye. at first i thought it was urlwatch being 2.28, but i've downgraded to 2.25 and the issue persists. so it must be rye's fault, as 2.25 works fine on other machines.

gaia avatar Dec 04 '23 22:12 gaia

Between 2.27 and 2.28, the backend for "navigate" jobs changed from Pyppeteer to Playwright.

In the logs you posted above, it says:

2023-12-02 21:18:07,377 reporters INFO: Submitting with stdout (<class 'urlwatch.reporters.StdoutReporter'>)
2023-12-02 21:18:07,377 reporters INFO: Submitting with email (<class 'urlwatch.reporters.EMailReporter'>)
2023-12-02 21:18:07,378 reporters DEBUG: Not sending e-mail (no changes)
2023-12-02 21:18:07,378 reporters INFO: Submitting with ifttt (<class 'urlwatch.reporters.IFTTTReport'>)
2023-12-02 21:18:07,378 reporters INFO: Submitting with pushover (<class 'urlwatch.reporters.PushoverReport'>)
2023-12-02 21:18:07,378 reporters DEBUG: Not sending pushover (no changes)

It seems like you have multiple reporters configured (stdout + ifttt appear in addition to e-mail and pushover?).

Also, according to the logs, this specific run didn't cause any e-mail or pushover notification to be sent?

Can you post (with private information replaced with "XXX"):

  1. The full urlwatch.yaml config file
  2. The full urls.yaml (you might have already posted it above?)

thp avatar Dec 09 '23 07:12 thp

@thp the issue is with rye.

since installing it with pipx it's been working fine.

I am new to rye though I've been using urlwatch for years and find it extremely helpful. rye looks very promising, so maybe you could take a look over there and give Armin a hand (if you find the time)?

gaia avatar Dec 10 '23 00:12 gaia

Not sure if rye can cause run-time issues like this, maybe some direct or indirect dependencies of urlwatch are installed in the wrong version?

@jkmaxwell did you use rye to install or is your issue unrelated to rye?

thp avatar Dec 10 '23 10:12 thp

Encountering the same issue, where the reporter is frequently reporting blanks.

Started several days ago when it appears that urlwatch was somehow updated, it was updated from 2.28 to 2.28_4. (2.28_4 was installed through Homebrew while other dependencies were updated).

Similar to above, the test-diff-filter shows the past several months of update history, and test-filter shows the current update. Inspecting the urlwatch cache shows that the history is being stored without issues. The job's cache guid has also not changed and is consistent. But every few hours the reporter will fire with NEW (not CHANGES), and the report will show there are no additions nor changes (it is blank/identical to above).

Inspection with --verbose looks similar, where the filter is applied, reports are made with stdout and telegram, and the debug states (no changes). 2024-01-13 15:14:58,353 reporters INFO: Submitting with stdout (<class 'urlwatch.reporters.StdoutReporter'>) 2024-01-13 15:14:58,353 reporters INFO: Submitting with telegram (<class 'urlwatch.reporters.TelegramReporter'>) 2024-01-13 15:14:58,353 reporters DEBUG: Not calling telegram API (no changes)

Uninstalling and reinstalling urlwatch hasn't seemed to have made any difference, as the blank reports are still surfacing. rye is not installed on the machine.

bb6 avatar Jan 13 '24 21:01 bb6

installed via brew here as well, not rye

jkmaxwell avatar Jan 18 '24 06:01 jkmaxwell

So far, have run into a single ERROR report - aside from the blank "NEW" reports. The error seems to be identical to the ERROR that @gaia encountered in the rye issues thread, except the use of brew in pathing instead of rye.

Traceback (most recent call last):
  File "/usr/local/Cellar/urlwatch/2.28_4/libexec/lib/python3.12/site-packages/urlwatch/handler.py", line 106, in process
    self.load()
  File "/usr/local/Cellar/urlwatch/2.28_4/libexec/lib/python3.12/site-packages/urlwatch/handler.py", line 85, in load
    self.old_data, self.timestamp, self.tries, self.etag = self.cache_storage.load(self.job, guid)
                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/urlwatch/2.28_4/libexec/lib/python3.12/site-packages/urlwatch/storage.py", line 571, in load
    for data, timestamp, tries, etag in CacheEntry.query(self.db,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 4, got 0)

The "NEW" blank reports only occur on a subset of the jobs.

After receiving the "NEW" blank reports for up to a week, they seemingly only occur when the computer is actively in use. If the computer is asleep/urlwatch is still running with a scheduled cron job, it appears the "NEW" blank reports will not occur.

Edit: March 4, 2024- This error is still occurring. For example, my reporter had fired approximately 40 times with "CHANGES" from July 2023 to January 2024 (around 8 months). From January to March, the reporter has fired approximately 585 times containing blank "NEW" reports, and only 5 of these reports have actually contained changes.

bb6 avatar Jan 19 '24 04:01 bb6

So far, have run into a single ERROR report - aside from the blank "NEW" reports. The error seems to be identical to the ERROR that @gaia encountered in the rye issues thread, except the use of brew in pathing instead of rye.

Traceback (most recent call last):
  File "/usr/local/Cellar/urlwatch/2.28_4/libexec/lib/python3.12/site-packages/urlwatch/handler.py", line 106, in process
    self.load()
  File "/usr/local/Cellar/urlwatch/2.28_4/libexec/lib/python3.12/site-packages/urlwatch/handler.py", line 85, in load
    self.old_data, self.timestamp, self.tries, self.etag = self.cache_storage.load(self.job, guid)
                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/urlwatch/2.28_4/libexec/lib/python3.12/site-packages/urlwatch/storage.py", line 571, in load
    for data, timestamp, tries, etag in CacheEntry.query(self.db,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 4, got 0)

The "NEW" blank reports only occur on a subset of the jobs.

After receiving the "NEW" blank reports for up to a week, they seemingly only occur when the computer is actively in use. If the computer is asleep/urlwatch is still running with a scheduled cron job, it appears the "NEW" blank reports will not occur.

Edit: March 4, 2024- This error is still occurring. For example, my reporter had fired approximately 40 times with "CHANGES" from July 2023 to January 2024 (around 8 months). From January to March, the reporter has fired approximately 585 times containing blank "NEW" reports, and only 5 of these reports have actually contained changes.

Giving my sad face emoji because this also happens to me. Or I get this Python error or a NEW entry appears.

mhalano avatar Mar 04 '24 23:03 mhalano

@mhalano can you help me with a recreate? If you can help me trigger the issue I can try and work out why its happening. Are you also installing with rye?

Jamstah avatar Mar 05 '24 12:03 Jamstah

No. I generated a wheel file from source and installed with pip. I think the problem is related to Python version 3.12. Version 3.11 didn't have any problems. I can install using pip or Ubuntu package manager. The question is I'm using the development version, that uses Python version 3.12. But sure, I will help you.

mhalano avatar Mar 05 '24 12:03 mhalano

Can you share the commands you used to build and install the wheel, and the exact python version?

Jamstah avatar Mar 05 '24 12:03 Jamstah

A pip list --format freeze would be useful too.

Jamstah avatar Mar 05 '24 12:03 Jamstah

I built from master but the most recent commit was e7a4d004425bdb275201ec5af45b24ac8c8d1946. The commands I used:

python3 setup.py bdist_wheel
pip3 install --force-reinstall --user dist/urlwatch-2.28-py3-none-any.whl

My Python version is 3.12.2

And pip list --format freeze, as requested:

alabaster==0.7.12
ansible==9.3.0
ansible-core==2.16.4
anyio==4.2.0
apparmor==4.0.0a4
appdirs==1.4.4
argcomplete==3.1.4
attrs==23.2.0
Babel==2.10.3
bcrypt==3.2.2
beautifulsoup4==4.12.3
blessed==1.20.0
blinker==1.7.0
BlinkStick==1.2.0
Brlapi==0.8.5
Brotli==1.1.0
certifi==2024.2.2
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.6
cloud-init==24.1
cloudscraper==1.2.71
colorama==0.4.6
command-not-found==0.3
configobj==5.0.8
construct==2.10.68
construct-legacy==2.5.3
cryptography==42.0.5
cssselect==1.2.0
cupshelpers==1.0
dbus-python==1.3.2
decorator==5.1.1
defer==1.0.6
defusedxml==0.7.1
Deprecated==1.2.14
devscripts==2.23.7
diceware==0.10
distlib==0.3.8
distro==1.9.0
distro-info==1.7
docker==5.0.3
docutils==0.20.1
dput==1.39
duplicity==2.1.4
enlighten==1.12.4
evdev==1.7.0
exceptiongroup==1.2.0
fasteners==0.18
filelock==3.13.1
freetype-py==2.4.0
future==0.18.2
gpg==1.18.0
h11==0.14.0
html5lib==1.1
httpcore==1.0.2
httpie==3.2.2
httplib2==0.20.4
httpx==0.26.0
idna==3.6
imagesize==1.4.1
img2pdf==0.5.1
importlib-metadata==4.12.0
iotop==0.6
jaraco.classes==3.3.1
jeepney==0.8.0
Jinja2==3.1.2
jmespath==1.0.1
jq==1.6.0
jsonpatch==1.32
jsonpointer==2.0
jsonschema==4.10.3
keyring==24.3.0
language-selector==0.1
launchpadlib==1.11.0
lazr.restfulclient==0.14.6
lazr.uri==1.0.6
LibAppArmor==4.0.0a4
libevdev==0.5
libvirt-python==10.0.0
lockfile==0.12.2
louis==3.28.0
lxml==5.1.0
Mako==1.3.2.dev0
Markdown==3.5.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
meson==1.3.2
minidb==2.0.7
monotonic==1.6
more-itertools==10.2.0
multidict==6.0.4
mutagen==1.46.0
nala==0.15.1
netifaces==0.11.0
nsz==4.6.1
ntlm-auth==1.5.0
oauthlib==3.2.2
olefile==0.46
packaging==23.2
paramiko==2.12.0
pbr==5.11.1
pefile==2023.2.7
pexpect==4.9.0
pikepdf==8.7.1
pillow==10.2.0
pip==24.0
platformdirs==4.2.0
prefixed==0.7.0
protobuf==4.21.12
psutil==5.9.8
ptyprocess==0.7.0
pycairo==1.25.1
pycparser==2.21
pycryptodome==3.20.0
pycryptodomex==3.20.0
pycups==2.0.1
Pygments==2.17.2
PyGObject==3.47.0.dev0
PyJWT==2.7.0
pykerberos==1.1.14
PyNaCl==1.5.0
pyparsing==3.1.1
pyrsistent==0.20.0
pyserial==3.5
PySocks==1.7.1
pyte==0.8.0
python-apt==2.7.6
python-dateutil==2.8.2
python-debian==0.1.49+ubuntu2
python-magic==0.4.27
python-xlib==0.33
pytz==2024.1
pyudev==0.24.0
pyusb==1.2.1.post2
pywinrm==0.4.3
pywinusb==0.4.2
pyxdg==0.28
PyYAML==6.0.1
reportlab==4.1.0
requests==2.31.0
requests-ntlm==1.1.0
requests-toolbelt==1.0.0
resolvelib==1.0.1
rich==13.7.0
rlPyCairo==0.3.0
rofication==1.2.2
roman==3.3
screen-resolution-extra==0.0.0
SecretStorage==3.3.3
setproctitle==1.3.3
setuptools==68.1.2
six==1.16.0
smartypants==2.0.0
sniffio==1.3.0
snowballstemmer==2.2.0
socksio==1.0.0
soupsieve==2.5
Sphinx==7.2.6
sphinx-rtd-theme==2.0.0
sphinxcontrib-jquery==4.1
ssh-import-id==5.11
stevedore==5.1.0
systemd-python==235
terminator==2.1.3
toml==0.10.2
tomli==2.0.1
typer==0.9.0
typing_extensions==4.9.0
typogrify==2.0.7
ubuntu-drivers-common==0.0.0
ubuntu-pro-client==8001
ufw==0.36.2
unattended-upgrades==0.1
unidiff==0.7.3
urllib3==2.2.1
urlwatch==2.28
usb-creator==0.3.16
virtualenv==20.25.0+ds
virtualenv-clone==0.5.7
virtualenvwrapper==4.8.4
wadllib==1.3.6
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.7.0
websockets==10.4
wheel==0.42.0
wrapt==1.15.0
xdg==5
xkit==0.0.0
xmltodict==0.13.0
yt-dlp==2023.12.30
zipp==1.0.0
zstandard==0.22.0

mhalano avatar Mar 05 '24 12:03 mhalano

Thanks. I managed to make it fail and have been having a play.

I expect this is a race in the cache. If I switch MAX_WORKERS in worker.py to 1, I don't get the errors any more. It would be good to hear if the same applies for you, but I'm assuming there's a DB lock somewhere that's causing this.

Jamstah avatar Mar 05 '24 13:03 Jamstah

@thp would be interested in your thought on this one. I'm not sure where to start looking, especially as minidb seems to be using locks on both read and write already. I might potentially try moving the job_state load call out of the parallelism.

I'm fairly sure its the minidb cache, because I also tries switching to the old file based caching mechanism, and that didn't exhibit the same problems. That one does also have a different relationship with history though.

Jamstah avatar Mar 05 '24 14:03 Jamstah

Yes, moving the load to not attempt to be parallel fixes the bug for me: https://github.com/thp/urlwatch/compare/master...Jamstah:threading?expand=1

Its a workaround though, minidb should be doing the locking already, unless RLock is doing something odd with the threading around python vs system threads.

My local system python is currently 3.12.1 and it doesn't exhibit the symptoms. To show the bug I ran in a container with 3.12.2. I'll update my system to 3.12.2 and see if that's the underlying change that exposes the bug too.

Jamstah avatar Mar 05 '24 15:03 Jamstah

Yes, I upgraded my system to python 3.12.2, and the bug appeared, so its definitely a change in 3.12.2 that exposes the bug.

Don't see anything obvious in the changelog though: https://docs.python.org/3/whatsnew/changelog.html#python-3-12-2-final

Jamstah avatar Mar 05 '24 15:03 Jamstah

Added a test that runs static command jobs 100 times which flags up the error reliably on python 3.12.2, so we can add that along with whatever solution we decide to go with: https://github.com/thp/urlwatch/compare/master...Jamstah:threading?expand=1

No PR yet because I'm not happy with the solution.

Jamstah avatar Mar 06 '24 10:03 Jamstah

I expect this is a race in the cache. If I switch MAX_WORKERS in worker.py to 1, I don't get the errors any more. It would be good to hear if the same applies for you, but I'm assuming there's a DB lock somewhere that's causing this.

24 hours after switching MAX_WORKERS in worker.py from 10 to 1, there has not been a "blank NEW" report, nor have there been any error messages.

bb6 avatar Mar 06 '24 13:03 bb6

I expect this is a race in the cache. If I switch MAX_WORKERS in worker.py to 1, I don't get the errors any more. It would be good to hear if the same applies for you, but I'm assuming there's a DB lock somewhere that's causing this.

24 hours after switching MAX_WORKERS in worker.py from 10 to 1, there has not been a "blank NEW" report, nor have there been any error messages.

This approach couldn't affect negatively the execution speed? I think 10 workers working in parallel is better than serialize everything to just one single worker, but this is a well-educated guess. I don't know if this is how urlwatch works.

mhalano avatar Mar 06 '24 19:03 mhalano

Yes, that will slow down execution speed. That test was really to check if it was a race condition, that isn't how I would solve the problem.

One option is here, which just moves loading the cache to be serialized (which it already was effectively because the database is serialized), then checks the urls in parallel: https://github.com/thp/urlwatch/compare/master...Jamstah:threading?expand=1

I'd like to wait for @thp to take a look though, because he wrote both urlwatch and the database layer, so may have some more thoughts.

Jamstah avatar Mar 06 '24 21:03 Jamstah

Yes, loading the cache doesn't need to be parallel, and loading it upfront seems like a great idea.

thp avatar Mar 12 '24 04:03 thp

Thanks, have opened the PR.

Jamstah avatar Mar 12 '24 10:03 Jamstah

Also, @Jamstah thanks for the test case for Python 3.12, I think this fixes it in minidb: https://github.com/thp/minidb/commit/433ae34280e38986078062b83f8dedefb6cf6a36

thp avatar Apr 24 '24 18:04 thp

minidb 2.0.8 should contain the fix.

thp avatar Apr 24 '24 19:04 thp