service-fabric-cli icon indicating copy to clipboard operation
service-fabric-cli copied to clipboard

Timeout on provision

Open jfalameda opened this issue 6 years ago • 8 comments

I am deploying a very large application. I get a timeout on provision, any tips?

OS: Mac OXS. Service fabric on docker, latest version.

`Command arguments: ['application', 'provision', '--application-type-build-path', 'DHCloud', '--timeout', '1000000', '--debug'] Event: Cli.PreExecute [] Event: CommandParser.OnGlobalArgumentsCreate [<function CLILogging.on_global_arguments at 0x1014632f0>, <function OutputProducer.on_global_arguments at 0x1014c96a8>, <function CLIQuery.on_global_arguments at 0x1014f8950>] Event: CommandInvoker.OnPreCommandTableCreate [] Event: CommandLoader.OnLoadArguments [] Event: CommandInvoker.OnPostCommandTableCreate [] Event: CommandInvoker.OnCommandTableLoaded [] Event: CommandInvoker.OnPreParseArgs [] Event: CommandInvoker.OnPostParseArgs [<function OutputProducer.handle_output_argument at 0x1014c9730>, <function CLIQuery.handle_query_parameter at 0x1014f89d8>] msrest.universal_http.requests : Configuring retry: max_retries=3, backoff_factor=0.8, max_backoff=90 msrest.service_client : Accept header absent and forced to application/json msrest.universal_http : Configuring redirects: allow=True, max=30 msrest.universal_http : Configuring request: timeout=100, verify=True, cert=None msrest.universal_http : Configuring proxies: '' msrest.universal_http : Evaluate proxies against ENV settings: True urllib3.connectionpool : Starting new HTTP connection (1): localhost msrest.exceptions : Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100) Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100) Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request six.raise_from(e, None) File "", line 2, in raise_from File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 383, in _make_request httplib_response = conn.getresponse() File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1321, in getresponse response.begin() File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 296, in begin version, status, reason = self._read_status() File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 257, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 440, in send timeout=timeout File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 639, in urlopen _stacktrace=sys.exc_info()[2]) File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 333, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise raise value File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 601, in urlopen chunked=chunked) File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 389, in _make_request self._raise_timeout(err=e, url=url, timeout_value=read_timeout) File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 309, in _raise_timeout raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value) urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 137, in send **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 508, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 618, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 521, in send raise ReadTimeout(e, request=request) requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/knack/cli.py", line 206, in invoke cmd_result = self.invocation.execute(args) File "/usr/local/lib/python3.7/site-packages/knack/invocation.py", line 188, in execute cmd_result = parsed_args.func(params) File "/usr/local/lib/python3.7/site-packages/knack/commands.py", line 105, in call return self.handler(*args, **kwargs) File "/usr/local/lib/python3.7/site-packages/knack/commands.py", line 212, in _command_handler result = op(client, **command_args) if client else op(**command_args) File "/usr/local/lib/python3.7/site-packages/sfctl/custom_app_type.py", line 117, in provision_application_type request, header_parameters, body_content_sorted) File "/usr/local/lib/python3.7/site-packages/msrest/service_client.py", line 336, in send pipeline_response = self.config.pipeline.run(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/init.py", line 197, in run return first_node.send(pipeline_request, **kwargs) # type: ignore File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/init.py", line 150, in send response = self.next.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/requests.py", line 72, in send return self.next.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/requests.py", line 137, in send return self.next.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/init.py", line 150, in send response = self.next.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/requests.py", line 193, in send self.driver.send(request.http_request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 328, in send return super(RequestsHTTPSender, self).send(request, **requests_kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 140, in send raise_with_traceback(ClientRequestError, msg, err) File "/usr/local/lib/python3.7/site-packages/msrest/exceptions.py", line 51, in raise_with_traceback raise error.with_traceback(exc_traceback) File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 137, in send **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 508, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 618, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 521, in send raise ReadTimeout(e, request=request) msrest.exceptions.ClientRequestError: Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100) msrest.universal_http.requests : Configuring retry: max_retries=3, backoff_factor=0.8, max_backoff=90 Performing cluster version check msrest.universal_http : Configuring redirects: allow=True, max=30 msrest.universal_http : Configuring request: timeout=100, verify=True, cert=None msrest.universal_http : Configuring proxies: '' msrest.universal_http : Evaluate proxies against ENV settings: True urllib3.connectionpool : Starting new HTTP connection (1): localhost urllib3.connectionpool : http://localhost:19080 "GET /$/GetClusterVersion?api-version=6.4&timeout=60 HTTP/1.1" 200 23`

jfalameda avatar Jan 25 '19 13:01 jfalameda

Hi @jfalameda, sorry to hear that you're running into this issue. To help us debug, can you share the output from running this command with the --debug option? Could you also share the cluster traces and the approximate time at which you ran the provision command? The traces would be in /etc/servicefabric/FabricDataRoot. Thanks!

Christina-Kang avatar Jan 28 '19 20:01 Christina-Kang

Hi @Christina-Kang,

It seems that the provision is taking few minutes and the timeout for the call to provision is 100.

You can download the tracing from here:

https://drive.google.com/file/d/17cAPVzd6v3v7PULhgxyN2_h4SSGRIMmL/view?usp=sharing

Kind Regards, / José.

jfalameda avatar Feb 06 '19 16:02 jfalameda

Hi @jfalameda,

Thanks for your patience! It looks like the actual timeout duration is not the issue here, but rather, an issue with permissions, which holds up the operation until it eventually times out.

Issue is with certain folders not being ACLed correctly. From your traces, it looks like the source is this: /home/ClusterDeployer/ClusterData/Data/N0030/Fabric/work/Applications/__FabricSystem_App4294967295/work/Store and the destination is this: /home/ClusterDeployer/ClusterData/Data/N0030/Fabric/work/ImageBuilderProxy/

There are also paths under different "nodes", represented in this path by "N0010" instead of the "N0030" in the previous path: /home/ClusterDeployer/ClusterData/Data/N0010/Fabric/work/Applications/__FabricSystem_App4294967295/work/Store. The same is for the destination path.

Can you check all these paths (N0010, N0020, N0030) to make sure that users have the correct permissions, and if not, chmod everything under those paths to give user permissions to rwx?

You should be able to retry upload after this is done. You should not need to, but if the above doesn't work, try restarting fabric and seeing if that plus the permission changes allows provision to continue. Please let me know if this mitigates the issue for you.

This issue happens rarely, so if you were to have a cluster in azure, or on a different dev machine, this issue most likely won't reoccur.

Hope this helps! Christina

Christina-Kang avatar Feb 07 '19 20:02 Christina-Kang

Hi @Christina-Kang ,

I can see the application provisioning now. I still get the timeout error as the application takes a lot of time to provision.

It is weird that I need to fix permission, as I am using the docker based service fabric.

Thanks.

Kind Regards, / José.

jfalameda avatar Feb 07 '19 23:02 jfalameda

Hi @jfalameda,

Is it timing out with the same error / stack as posted above still? Where the timeout is 100 for read timeouts, rather than the passed in time out.

Thanks!

Christina-Kang avatar Feb 08 '19 01:02 Christina-Kang

Hi @Christina-Kang ,

Yes, the application times out the same way. I can see on SF that the application is still provisioning. The application disappears when the provisioning is done without any error, therefore I cannot publish it.

PS: After checking the checksums it shows "unprovisioning." then it disappears from the applications list.

Kind Regards, / José.

jfalameda avatar Feb 14 '19 15:02 jfalameda

Hi @Christina-Kang

Unfortunately 7.1.0 didn't resolve this issue. The timeout is not about uploading the application, but to provision it. The application takes few minutes to provision, while sfctl timeouts before it has completed. By the time it is 100% provisioned, the application unprovision itself automatically.

This is the error I get on the client, I can see the applicaiton is still provisioning on the service fabric explorer.

Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)
Traceback (most recent call last):
  File "/home/jose/.local/lib/python2.7/site-packages/knack/cli.py", line 206, in invoke
    cmd_result = self.invocation.execute(args)
  File "/home/jose/.local/lib/python2.7/site-packages/sfctl/entry.py", line 81, in execute
    return super(SFInvoker, self).execute(args)
  File "/home/jose/.local/lib/python2.7/site-packages/knack/invocation.py", line 188, in execute
    cmd_result = parsed_args.func(params)
  File "/home/jose/.local/lib/python2.7/site-packages/knack/commands.py", line 105, in __call__
    return self.handler(*args, **kwargs)
  File "/home/jose/.local/lib/python2.7/site-packages/knack/commands.py", line 212, in _command_handler
    result = op(client, **command_args) if client else op(**command_args)
  File "/home/jose/.local/lib/python2.7/site-packages/sfctl/custom_app_type.py", line 117, in provision_application_type
    request, header_parameters, body_content_sorted)
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/service_client.py", line 336, in send
    pipeline_response = self.config.pipeline.run(request, **kwargs)
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/__init__.py", line 197, in run
    return first_node.send(pipeline_request, **kwargs)  # type: ignore
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/__init__.py", line 150, in send
    response = self.next.send(request, **kwargs)
  File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/requests.py", line 72, in send
    return self.next.send(request, **kwargs)

Kind Regards, / Jose.

jfalameda avatar Feb 22 '19 08:02 jfalameda

@jfalameda While the timeout may occur on the client side (this is a bug on our part, that the timeout can occur before the cluster times out), the cluster should continue its work with provisioning. There should be no unprovision happening, even on error, from the cluster side. If you are running the upload / provision / create process as part of a script, this may be causing the unprovisioning to happen. If this is not the case, are you available for a call with us so we can debug this faster? You can contact me at [email protected]

Thanks!

Christina-Kang avatar Feb 22 '19 22:02 Christina-Kang