commons Spurious failure when dumping a chroot

This seems very likely to be related to 2ffb09f5cc30639c0667337315667a03919c779e. Here's the trace of what we're seeing on the CI machine: https://gist.github.com/patricklaw/64ff73a57cdc9cb1b7c9

This is our internal process for running pylint over our python codebase. It aggregates PylintSpec targets, injects the dependency spec for pylint (pylint>=1.1.0), then dumps that chroot. Because we keep our egg download cache project local rather than user local, there should not be two instances of pants ever accessing this particular file concurrently. Therefore it seems like something within pants itself might be.

Mar 19 '14 16:03 patricklaw

Ok - so there is code that dumps a chroot, and then there is the code at the top of the trace:

File "/export/hdb3/data/appdata/jenkins/workspace/foursquare.hadoop.test/src/python/foursquare/pants/python_linter.py", line 47, in execute
    self._run_pylint(target)
...

If this code all in the same process tree? Because if so - quick searches say this may be near the root of the problem: http://trac.edgewall.org/ticket/7014

Mar 19 '14 17:03 jsirois

This code is all in the same process.

Mar 19 '14 17:03 patricklaw

Ok - then that ticket's info may be apropos. Can you try after the chroot dump exec'ing the rest? I realize this may not be a 1 liner, but a dirty hack for the datapoint would be good if its not too hard.

Mar 19 '14 17:03 jsirois

This error message is indicative of importing from a zipped egg and having that egg rewritten at some point in the lifecycle of the running process. Zipimport caches zip headers, closes the filehandle and doesn't invalidate them if the underlying zip stat() changes on next reopen.

Mar 19 '14 18:03 wickman

Yeah - thats what the track ticket above talks about.

Mar 19 '14 18:03 jsirois

John, I don't understand what you're asking me to do there. After the chroot? It's the chroot that fails. Or do you want me to run the call to chroot in another process?

Mar 19 '14 18:03 patricklaw

My understanding is you 1.) create the chroot then 2.) run something in it. The idea would be to do 1 as-is, but then exec to do 2.) which will wipe out the in-memory Zipimport caches @wickman and the trac bug point out as problematic under zip file change. Since we have no repro case I'm guessing via the description, but in the absence of a repro we can run I'm asking you to get your hands dirty.

Mar 19 '14 19:03 jsirois

Okay, I'm not super familiar with chroot, so let me double check what you're asking. You want us to first create the chroot, which in our code looks like: chroot = PythonChroot(target, root_dir, *args, **kwargs) and then you want us to use, e.g., multiprocessing to call the later builder = chroot.dump() ?

Mar 19 '14 20:03 patricklaw

Something like that - yes. Again - if its complicated to hack that test change in it may not be worth it. But if its 5 lines, it would be great to have a confirm.

Mar 19 '14 21:03 jsirois

I slapped something together, I'm going to put it up on our CI shortly to see if it reduces the frequency of failures. Can you double check my logic here so we're on the same page about what information this is providing?

  def build_and_run_pex(self, chroot, entry_point, args, stdout=None, stderr=None):
    q = Queue()
    def dump_chroot():
      builder = chroot.dump()
      builder.info.entry_point = entry_point
      builder.freeze()
      pex = PEX(builder.path())
      cmdline = pex.cmdline(args)
      q.put(cmdline)
    chroot_process = Process(target=dump_chroot)
    chroot_process.start()
    cmdline = q.get()
    chroot_process.join()
    process = subprocess.Popen(cmdline, stdout=stdout, stderr=stderr)
    return process

Mar 19 '14 22:03 patricklaw

Seems like this did in fact either eliminate the problem or reduce its frequency. The build has been green for 4 straight since the change went in.

Mar 20 '14 01:03 patricklaw

Sorry for the response delay and thanks for experimenting. We could clearly work around this via anywhere from elaborate to insane amounts of re-implementation, but if you can unblock via simpler means and instead file a targeted issue to divorce commons/python from bad ZipImporter behavior - that would be preferrable.

Mar 20 '14 01:03 jsirois

Okay, I'm afraid I need to retract my statement! After going to bed, the build failed several more times with this error. Looks like it just had a lucky streak after my change.

Mar 20 '14 14:03 patricklaw

What's the timeline on getting a reasonable solution out for this? It seems to be within @wickman's general realm of expertise. It's now getting to the point where rolling all the way back to a several weeks old pants would be difficult for us internally (we had to change a bunch of our custom plugins to account for new pants changes), but this spurious failure is hammering a couple of our builds.

I would consider getting out a solution to this very high priority. Even a clear explanation of what's going on and how I could tackle it myself would be very useful. I'm confused about why something that's already been linked into the install cache is somehow mutable. Once it's there, isn't it there permanently and immutably, regardless of whether or not an interpret has loaded it?

Mar 20 '14 16:03 patricklaw

Patrick and I worked offline here a bit and the running theory are the TTLs of 1 hr in the pants PythonChroot.dump call paths. There are 2 - 1 for dist crawling, 1 for non-exact reqs (to niggle a search for latest publications). Patrick is going to try adding a ttl= kwarg to dump and setting this to 1 day to see if these issues go away for a day and report back.

Mar 20 '14 18:03 jsirois

Evidence so far seems to indicate that this was probably the issue. We'll be much more sure in a few hours after some runs have finished. John and I also discussed a potential solution to this problem:

Early on in the bootstrap process of a given pex run, when the resolver finds a bunch of cached eggs it wants to use, it should make a temp dir and hard link to all of those eggs. Then if they get removed or fiddled with (maybe from itself) it doesn't matter, since the origin file will stick around until the process dies and cleans itself up. Worst case, a process shuts down uncleanly and doesn't get cleaned up. You can also label the view directories by pid, and have a little hook at startup that cleans up garbage from pids that aren't running anymore.

@wickman What do you think of this solution? Is it something you'd be able to bang out pretty quickly, or otherwise advise me on how to get started?

Mar 20 '14 19:03 patricklaw

commons commons copied to clipboard

Spurious failure when dumping a chroot

commons
commons copied to clipboard