commons
commons copied to clipboard
Spurious failure when dumping a chroot
This seems very likely to be related to 2ffb09f5cc30639c0667337315667a03919c779e. Here's the trace of what we're seeing on the CI machine: https://gist.github.com/patricklaw/64ff73a57cdc9cb1b7c9
This is our internal process for running pylint over our python codebase. It aggregates PylintSpec targets, injects the dependency spec for pylint (pylint>=1.1.0), then dumps that chroot. Because we keep our egg download cache project local rather than user local, there should not be two instances of pants ever accessing this particular file concurrently. Therefore it seems like something within pants itself might be.
Ok - so there is code that dumps a chroot, and then there is the code at the top of the trace:
File "/export/hdb3/data/appdata/jenkins/workspace/foursquare.hadoop.test/src/python/foursquare/pants/python_linter.py", line 47, in execute
self._run_pylint(target)
...
If this code all in the same process tree? Because if so - quick searches say this may be near the root of the problem: http://trac.edgewall.org/ticket/7014
This code is all in the same process.
Ok - then that ticket's info may be apropos. Can you try after the chroot dump exec'ing the rest? I realize this may not be a 1 liner, but a dirty hack for the datapoint would be good if its not too hard.
This error message is indicative of importing from a zipped egg and having that egg rewritten at some point in the lifecycle of the running process. Zipimport caches zip headers, closes the filehandle and doesn't invalidate them if the underlying zip stat() changes on next reopen.
Yeah - thats what the track ticket above talks about.
John, I don't understand what you're asking me to do there. After the chroot? It's the chroot that fails. Or do you want me to run the call to chroot in another process?
My understanding is you 1.) create the chroot then 2.) run something in it. The idea would be to do 1 as-is, but then exec to do 2.) which will wipe out the in-memory Zipimport caches @wickman and the trac bug point out as problematic under zip file change. Since we have no repro case I'm guessing via the description, but in the absence of a repro we can run I'm asking you to get your hands dirty.
Okay, I'm not super familiar with chroot, so let me double check what you're asking. You want us to first create the chroot, which in our code looks like: chroot = PythonChroot(target, root_dir, *args, **kwargs) and then you want us to use, e.g., multiprocessing to call the later builder = chroot.dump() ?
Something like that - yes. Again - if its complicated to hack that test change in it may not be worth it. But if its 5 lines, it would be great to have a confirm.
I slapped something together, I'm going to put it up on our CI shortly to see if it reduces the frequency of failures. Can you double check my logic here so we're on the same page about what information this is providing?
def build_and_run_pex(self, chroot, entry_point, args, stdout=None, stderr=None):
q = Queue()
def dump_chroot():
builder = chroot.dump()
builder.info.entry_point = entry_point
builder.freeze()
pex = PEX(builder.path())
cmdline = pex.cmdline(args)
q.put(cmdline)
chroot_process = Process(target=dump_chroot)
chroot_process.start()
cmdline = q.get()
chroot_process.join()
process = subprocess.Popen(cmdline, stdout=stdout, stderr=stderr)
return process
Seems like this did in fact either eliminate the problem or reduce its frequency. The build has been green for 4 straight since the change went in.
Sorry for the response delay and thanks for experimenting. We could clearly work around this via anywhere from elaborate to insane amounts of re-implementation, but if you can unblock via simpler means and instead file a targeted issue to divorce commons/python from bad ZipImporter behavior - that would be preferrable.
Okay, I'm afraid I need to retract my statement! After going to bed, the build failed several more times with this error. Looks like it just had a lucky streak after my change.
What's the timeline on getting a reasonable solution out for this? It seems to be within @wickman's general realm of expertise. It's now getting to the point where rolling all the way back to a several weeks old pants would be difficult for us internally (we had to change a bunch of our custom plugins to account for new pants changes), but this spurious failure is hammering a couple of our builds.
I would consider getting out a solution to this very high priority. Even a clear explanation of what's going on and how I could tackle it myself would be very useful. I'm confused about why something that's already been linked into the install cache is somehow mutable. Once it's there, isn't it there permanently and immutably, regardless of whether or not an interpret has loaded it?
Patrick and I worked offline here a bit and the running theory are the TTLs of 1 hr in the pants PythonChroot.dump call paths. There are 2 - 1 for dist crawling, 1 for non-exact reqs (to niggle a search for latest publications). Patrick is going to try adding a ttl= kwarg to dump and setting this to 1 day to see if these issues go away for a day and report back.
Evidence so far seems to indicate that this was probably the issue. We'll be much more sure in a few hours after some runs have finished. John and I also discussed a potential solution to this problem:
Early on in the bootstrap process of a given pex run, when the resolver finds a bunch of cached eggs it wants to use, it should make a temp dir and hard link to all of those eggs. Then if they get removed or fiddled with (maybe from itself) it doesn't matter, since the origin file will stick around until the process dies and cleans itself up. Worst case, a process shuts down uncleanly and doesn't get cleaned up. You can also label the view directories by pid, and have a little hook at startup that cleans up garbage from pids that aren't running anymore.
@wickman What do you think of this solution? Is it something you'd be able to bang out pretty quickly, or otherwise advise me on how to get started?