scrapyd Scrapyd creating new egg files in .cache every time a job is scheduled.

Hi,

I'm honestly unsure if this is expected behavior or not but it's caused me problems with storage: I literally ran out of inodes when /root/.cache/Python-Eggs had over five million files.

I finally tracked this phenomenon down, deleted everything in the directory, and then watched as another 40 files were created in the last hour.

So I tried manually scheduling a job (instead of using crontab as my jobs are normally scheduled) and saw an extra egg-tmp directory created when I ran

curl http://localhost:6800/schedule.json -d project=newscrawler -d spider=abc
{"node_name": "homebase", "status": "ok", "jobid": "5852092af3aa11ebae0560a44c5fd074"}

That curl is the way I schedule all my jobs, about 40 different spiders every hour, so all of this is making sense, but why are eggs being created on every job?

Aug 02 '21 16:08 mas-4

which operating system are you using? Python version? There is os.remove() call here, wonder why is it not working for you @mas-4

https://github.com/scrapy/scrapyd/blob/3ed9b89e3e0a9bdaf34487e7fc3da7cf463e7250/scrapyd/runner.py#L31

Nov 11 '21 12:11 pawelmhm

@pawelmhm I'm using Arch Linux and the latest version of Python available in core: (3.9.7-2), https://archlinux.org/packages/core/x86_64/python/

I haven't messed with this issue much since opening the ticket because I scheduled a cronjob to just delete the eggs every hour. I suppose I can try debugging this specific line. Blame says it was added 11 years ago so I suppose its not a scrapyd version issue.

Nov 17 '21 18:11 mas-4

Blame says it was added 11 years ago so I suppose its not a scrapyd version issue.

@mas-4 don't assume this, it might be a bug too, maybe something bad happens here and it can be fixed. I wonder what happens when you schedule a spider and something is interrupted, wonder if everything will be really deleted properly in all possible cases, e.g. if scrapyd crashes or spider crashes or something else happens.

about your test

So I tried manually scheduling a job (instead of using crontab as my jobs are normally scheduled) and saw an extra egg-tmp directory created when I ran

so directory was created, but it should be created for whole duration of scrapy command, "execute" there in file is basically scrapy execute, intention is to delete it after execute is done, need to verify if it really happens.

Nov 17 '21 18:11 pawelmhm

@pawelmhm

So there's always the possibility that I have something misconfigured, but it seems the primary issue is that the path constructed to the egg file is wrong. I did some old fashioned print debugging:

@contextmanager
def project_environment(project):
    app = get_application()
    eggstorage = app.getComponent(IEggStorage)
    eggversion = os.environ.get('SCRAPY_EGG_VERSION', None)
    version, eggfile = eggstorage.get(project, eggversion)
    print("**** Initial Egg File", eggfile, "****")
    if eggfile:
        prefix = '%s-%s-' % (project, version)
        fd, eggpath = tempfile.mkstemp(prefix=prefix, suffix='.egg')
        lf = os.fdopen(fd, 'wb')
        shutil.copyfileobj(eggfile, lf)
        lf.close()
        activate_egg(eggpath)
    else:
        eggpath = None
    try:
        assert 'scrapy.conf' not in sys.modules, "Scrapy settings already loaded"
        yield
    finally:
        print("**** Final Egg Path", eggpath, "****")
        if eggpath:
            os.remove(eggpath)

The result was this:

Nov 20 10:27:43 homebase scrapyd[1509]: 2021-11-20T10:27:43-0500 [Launcher,2540/stdout] **** Initial Egg File <_io.BufferedReader name='
eggs/newscrawler/1636489985.egg'> ****
Nov 20 10:27:43 homebase scrapyd[1509]:         <redacted>
Nov 20 10:27:43 homebase scrapyd[1509]:         **** Final Egg Path /tmp/newscrawler-1636489985-26mq5umj.egg ****

It seems there are egg files being created in /tmp/ and it looks like they are being successfully deleted. But there are egg-tmp files being created in /root/.cache which are not. These are the primary issue and that os.remove call doesn't affect them:

[root@homebase Python-Eggs]# ls
newscrawler-1636489985-15vrmnip.egg-tmp  newscrawler-1636489985-agkffdwr.egg-tmp  newscrawler-1636489985-mweuo66w.egg-tmp
newscrawler-1636489985-26mq5umj.egg-tmp  newscrawler-1636489985-bpieg17l.egg-tmp  newscrawler-1636489985-oir0nd4v.egg-tmp
newscrawler-1636489985-3p_8jn26.egg-tmp  newscrawler-1636489985-ewbuc4vg.egg-tmp  newscrawler-1636489985-sjbwp61o.egg-tmp
newscrawler-1636489985-3p9wnqy4.egg-tmp  newscrawler-1636489985-g89_ad6y.egg-tmp  newscrawler-1636489985-u8w8jfus.egg-tmp
newscrawler-1636489985-4ok8zye7.egg-tmp  newscrawler-1636489985-gi8btc2m.egg-tmp  newscrawler-1636489985-udgq5sum.egg-tmp
newscrawler-1636489985-5vn_p1sp.egg-tmp  newscrawler-1636489985-jcqramkj.egg-tmp  newscrawler-1636489985-vj9ktkfr.egg-tmp
newscrawler-1636489985-6n0iig6b.egg-tmp  newscrawler-1636489985-jlq4f0q8.egg-tmp  newscrawler-1636489985-w8pddttk.egg-tmp
newscrawler-1636489985-6p13ekig.egg-tmp  newscrawler-1636489985-le4ujoyk.egg-tmp  newscrawler-1636489985-xh7wkg2l.egg-tmp
newscrawler-1636489985-6xyj0gm5.egg-tmp  newscrawler-1636489985-lgejr0l6.egg-tmp  newscrawler-1636489985-yp7er7s4.egg-tmp
newscrawler-1636489985-8yrors7e.egg-tmp  newscrawler-1636489985-mj4h8qd2.egg-tmp  newscrawler-1636489985-yummnq50.egg-tmp
newscrawler-1636489985-9gjeceb1.egg-tmp  newscrawler-1636489985-mw72rr9m.egg-tmp  newscrawler-1636489985-zc4wpmh4.egg-tmp
[root@homebase Python-Eggs]# ls /tmp
newscrawler-1636489985-6n0iig6b.egg  ssh-XXXXXXxm4aTQ                                                                   tmux-1000
nvimeNJJPJ

So I think I misrepresented the problem in my first post (having misunderstood the nature of the files).

Nov 20 '21 15:11 mas-4

scrapyd scrapyd copied to clipboard

Scrapyd creating new egg files in .cache every time a job is scheduled.

scrapyd
scrapyd copied to clipboard