pyramid_swagger icon indicating copy to clipboard operation
pyramid_swagger copied to clipboard

Error caused due to thread safety issues

Open achuth-amplify opened this issue 8 years ago • 9 comments

This error is caused because of thread safety issues while trying to use the deref method to validate the swagger spec.

[Wed Feb 08 10:09:26 2017] [error] Traceback (most recent call last):
[Wed Feb 08 10:09:26 2017] [error]   File "/opt/wgen/popeye/lib/python/pyramid-1.6.5-py2.7.egg/pyramid/tweens.py", line 20, in excview_tween
[Wed Feb 08 10:09:26 2017] [error]     response = handler(request)
[Wed Feb 08 10:09:26 2017] [error]   File "/opt/wgen/popeye/lib/python/pyramid_swagger-2.3.0-py2.7.egg/pyramid_swagger/tween.py", line 184, in validator_tween
[Wed Feb 08 10:09:26 2017] [error]     swagger_handler.handle_response(response, op_or_validators_map)
[Wed Feb 08 10:09:26 2017] [error]   File "/opt/wgen/popeye/lib/python/pyramid_swagger-2.3.0-py2.7.egg/pyramid_swagger/tween.py", line 459, in _validate
[Wed Feb 08 10:09:26 2017] [error]     return f(*args, **kwargs)
[Wed Feb 08 10:09:26 2017] [error]   File "/opt/wgen/popeye/lib/python/pyramid_swagger-2.3.0-py2.7.egg/pyramid_swagger/tween.py", line 584, in swaggerize_response
[Wed Feb 08 10:09:26 2017] [error]     response_spec, op, PyramidSwaggerResponse(response))
[Wed Feb 08 10:09:26 2017] [error]   File "/opt/wgen/popeye/lib/python/bravado_core-4.6.0-py2.7.egg/bravado_core/response.py", line 159, in validate_response
[Wed Feb 08 10:09:26 2017] [error]     validate_response_body(op, response_spec, response)
[Wed Feb 08 10:09:26 2017] [error]   File "/opt/wgen/popeye/lib/python/bravado_core-4.6.0-py2.7.egg/bravado_core/response.py", line 176, in validate_response_body
[Wed Feb 08 10:09:26 2017] [error]     response_body_spec = deref(response_spec.get('schema'))
[Wed Feb 08 10:09:26 2017] [error]   File "/opt/wgen/popeye/lib/python/bravado_core-4.6.0-py2.7.egg/bravado_core/spec.py", line 202, in deref
[Wed Feb 08 10:09:26 2017] [error]     _, target = self.resolver.resolve(ref_dict['$ref'])
[Wed Feb 08 10:09:26 2017] [error]   File "/opt/wgen/popeye/lib/python/jsonschema-2.5.1-py2.7.egg/jsonschema/validators.py", line 335, in resolve
[Wed Feb 08 10:09:26 2017] [error]     url = self._urljoin_cache(self.resolution_scope, ref)
[Wed Feb 08 10:09:26 2017] [error]   File "/opt/wgen/popeye/lib/python/jsonschema-2.5.1-py2.7.egg/jsonschema/validators.py", line 302, in resolution_scope
[Wed Feb 08 10:09:26 2017] [error]     return self._scopes_stack[-1]
[Wed Feb 08 10:09:26 2017] [error] IndexError: list index out of range

achuth-amplify avatar Feb 13 '17 20:02 achuth-amplify

That seems...unpleasant. When you say "thread safety" you mean we're trying to access a URL-based ref before we've actually fetched it? Is it a racey issue or consistently reproducible? I'm feeling like I don't quite follow the holistic issue as well as you do.

striglia avatar Feb 13 '17 21:02 striglia

it is a race issue.

I'm pretty sure the issue is caused by unsafe thread code in the bravado-core - seems to do some fancy things with config search scopes in it's use of jasonschema (https://github.com/Julian/jsonschema/blob/master/jsonschema/validators.py#L367 note the use of generators). Wasn't able to find specific documentation on concurrency support in bravado or jsonschema to confirm so I'm thinking they aren't :/.

achuth-amplify avatar Feb 13 '17 21:02 achuth-amplify

To flesh out the background a little more, this is an issue that we never encounter in local development, but which shows up in deployment after a certain amount of traffic at a certain level. We thought it was caused by some code that I wrote that dug information out of the spec at unexpected times, and when we took that out we thought it was fixed (the errors didn't show up within the expected time period), but it turned out we had just extended the time required for the error to show up. This seems to support the notion that it's a somewhat hard-to-reach race condition (though not that hard to reach—this service isn't very high traffic, in the grand scheme of things).

After we first see this error, it's unrecoverable without restarting the server (apache 2.2/mod_wsgi), which supports the notion that there's a shared object that gets into a bad state and stays that way more or less permanently.

bwarfield-amplify avatar Mar 10 '17 22:03 bwarfield-amplify

Curious...the mere use of the resolving generator looks reasonable to me and I don't see where we are capable of ending up with nothing on the scopes_stack variable. Perhaps pushing scope can somehow fail and get caught, resulting in a later pop_scope that will fail with this IndexError?

Can you describe where these errors occur in the lifecycle of a single server? You describe an "expected time period" -- are you only seeing errors ~immediately after starting the server? Or does the issue eventually reproduce due to regular recycling of apache workers? And does it happen with various different specs, or one in particular?

A minimal repro setup would be amazing, as that feels required to tell how we end up with an empty stack while we're trying to resolve something. The init and other code looks pretty clear about always ensuring a stack size of at least 1, but I'm no jsonschema expert.

striglia avatar Mar 10 '17 23:03 striglia

Ah! This seems likely to be quite relevant https://github.com/Julian/jsonschema/issues/223. Quoted:

It's not explicitly a goal to be threadsafe, generally I'd recommend just creating a different validator for each thread, but having _scope_stack not be scoped to each call might be a thing worth fixing for cleanliness reasons if we can come up with a decent way to do so.

I'm not sure I quite mentally see the end-to-end repro here yet, but between your bug report and this issue, it certainly seems likely this is related. Seems like one way or another, we're managing to pop the stack once too far, and the next attempt to resolve a ref will hit this error as it accesses self.resolution_scope.

striglia avatar Mar 10 '17 23:03 striglia

So... does this mean no one is using this library on a production system?!! Or is there a work around to allow it to run on multithreaded uwsgi server? After implementing this we did a little bit of testing and found that it was unusable for anything except sequential calls to the API. The only other option we have found is running uwsgi as a single worker/thread.

p4ul avatar Oct 02 '17 03:10 p4ul

@p4ul People are using it on production systems - just probably not using multiple threads. Typically people use multiple processes to get around the CPython GIL.

sjaensch avatar Oct 02 '17 09:10 sjaensch

Yeah we've been using it at decent scale (100qps+) at Yelp and I'm not aware of any issues here. We use uwsgi across different processes here, so we aren't exposed to any thread-safe issues around jsonschema+web workers.

I'm not super sure how to properly instrument a pyramid tween to only run post-fork, which seems like the way you'd generally want to solve this. If you're using uwsgi, there's the postfork decorator for running arbitrary code post-fork, but I don't immediately know of any fix we can do after this shared object is initialized. I suppose we could rebuild it from scratch, though that seems a touch wasteful...

striglia avatar Oct 02 '17 17:10 striglia

@sjaensch @striglia thanks! we will try running it using different processes

p4ul avatar Oct 02 '17 19:10 p4ul