scrapyd
scrapyd copied to clipboard
Scrapyd creating new egg files in .cache every time a job is scheduled.
Hi,
I'm honestly unsure if this is expected behavior or not but it's caused me problems with storage: I literally ran out of inodes when /root/.cache/Python-Eggs
had over five million files.
I finally tracked this phenomenon down, deleted everything in the directory, and then watched as another 40 files were created in the last hour.
So I tried manually scheduling a job (instead of using crontab as my jobs are normally scheduled) and saw an extra egg-tmp directory created when I ran
curl http://localhost:6800/schedule.json -d project=newscrawler -d spider=abc
{"node_name": "homebase", "status": "ok", "jobid": "5852092af3aa11ebae0560a44c5fd074"}
That curl is the way I schedule all my jobs, about 40 different spiders every hour, so all of this is making sense, but why are eggs being created on every job?
which operating system are you using? Python version? There is os.remove() call here, wonder why is it not working for you @mas-4
https://github.com/scrapy/scrapyd/blob/3ed9b89e3e0a9bdaf34487e7fc3da7cf463e7250/scrapyd/runner.py#L31
@pawelmhm I'm using Arch Linux and the latest version of Python available in core: (3.9.7-2), https://archlinux.org/packages/core/x86_64/python/
I haven't messed with this issue much since opening the ticket because I scheduled a cronjob to just delete the eggs every hour. I suppose I can try debugging this specific line. Blame says it was added 11 years ago so I suppose its not a scrapyd version issue.
Blame says it was added 11 years ago so I suppose its not a scrapyd version issue.
@mas-4 don't assume this, it might be a bug too, maybe something bad happens here and it can be fixed. I wonder what happens when you schedule a spider and something is interrupted, wonder if everything will be really deleted properly in all possible cases, e.g. if scrapyd crashes or spider crashes or something else happens.
about your test
So I tried manually scheduling a job (instead of using crontab as my jobs are normally scheduled) and saw an extra egg-tmp directory created when I ran
so directory was created, but it should be created for whole duration of scrapy command, "execute" there in file is basically scrapy execute, intention is to delete it after execute is done, need to verify if it really happens.
@pawelmhm
So there's always the possibility that I have something misconfigured, but it seems the primary issue is that the path constructed to the egg file is wrong. I did some old fashioned print debugging:
@contextmanager
def project_environment(project):
app = get_application()
eggstorage = app.getComponent(IEggStorage)
eggversion = os.environ.get('SCRAPY_EGG_VERSION', None)
version, eggfile = eggstorage.get(project, eggversion)
print("**** Initial Egg File", eggfile, "****")
if eggfile:
prefix = '%s-%s-' % (project, version)
fd, eggpath = tempfile.mkstemp(prefix=prefix, suffix='.egg')
lf = os.fdopen(fd, 'wb')
shutil.copyfileobj(eggfile, lf)
lf.close()
activate_egg(eggpath)
else:
eggpath = None
try:
assert 'scrapy.conf' not in sys.modules, "Scrapy settings already loaded"
yield
finally:
print("**** Final Egg Path", eggpath, "****")
if eggpath:
os.remove(eggpath)
The result was this:
Nov 20 10:27:43 homebase scrapyd[1509]: 2021-11-20T10:27:43-0500 [Launcher,2540/stdout] **** Initial Egg File <_io.BufferedReader name='
eggs/newscrawler/1636489985.egg'> ****
Nov 20 10:27:43 homebase scrapyd[1509]: <redacted>
Nov 20 10:27:43 homebase scrapyd[1509]: **** Final Egg Path /tmp/newscrawler-1636489985-26mq5umj.egg ****
It seems there are egg files being created in /tmp/
and it looks like they are being successfully deleted. But there are egg-tmp
files being created in /root/.cache
which are not. These are the primary issue and that os.remove
call doesn't affect them:
[root@homebase Python-Eggs]# ls
newscrawler-1636489985-15vrmnip.egg-tmp newscrawler-1636489985-agkffdwr.egg-tmp newscrawler-1636489985-mweuo66w.egg-tmp
newscrawler-1636489985-26mq5umj.egg-tmp newscrawler-1636489985-bpieg17l.egg-tmp newscrawler-1636489985-oir0nd4v.egg-tmp
newscrawler-1636489985-3p_8jn26.egg-tmp newscrawler-1636489985-ewbuc4vg.egg-tmp newscrawler-1636489985-sjbwp61o.egg-tmp
newscrawler-1636489985-3p9wnqy4.egg-tmp newscrawler-1636489985-g89_ad6y.egg-tmp newscrawler-1636489985-u8w8jfus.egg-tmp
newscrawler-1636489985-4ok8zye7.egg-tmp newscrawler-1636489985-gi8btc2m.egg-tmp newscrawler-1636489985-udgq5sum.egg-tmp
newscrawler-1636489985-5vn_p1sp.egg-tmp newscrawler-1636489985-jcqramkj.egg-tmp newscrawler-1636489985-vj9ktkfr.egg-tmp
newscrawler-1636489985-6n0iig6b.egg-tmp newscrawler-1636489985-jlq4f0q8.egg-tmp newscrawler-1636489985-w8pddttk.egg-tmp
newscrawler-1636489985-6p13ekig.egg-tmp newscrawler-1636489985-le4ujoyk.egg-tmp newscrawler-1636489985-xh7wkg2l.egg-tmp
newscrawler-1636489985-6xyj0gm5.egg-tmp newscrawler-1636489985-lgejr0l6.egg-tmp newscrawler-1636489985-yp7er7s4.egg-tmp
newscrawler-1636489985-8yrors7e.egg-tmp newscrawler-1636489985-mj4h8qd2.egg-tmp newscrawler-1636489985-yummnq50.egg-tmp
newscrawler-1636489985-9gjeceb1.egg-tmp newscrawler-1636489985-mw72rr9m.egg-tmp newscrawler-1636489985-zc4wpmh4.egg-tmp
[root@homebase Python-Eggs]# ls /tmp
newscrawler-1636489985-6n0iig6b.egg ssh-XXXXXXxm4aTQ tmux-1000
nvimeNJJPJ
So I think I misrepresented the problem in my first post (having misunderstood the nature of the files).