service-fabric-cli
service-fabric-cli copied to clipboard
Timeout on provision
I am deploying a very large application. I get a timeout on provision, any tips?
OS: Mac OXS. Service fabric on docker, latest version.
`Command arguments: ['application', 'provision', '--application-type-build-path', 'DHCloud', '--timeout', '1000000', '--debug']
Event: Cli.PreExecute []
Event: CommandParser.OnGlobalArgumentsCreate [<function CLILogging.on_global_arguments at 0x1014632f0>, <function OutputProducer.on_global_arguments at 0x1014c96a8>, <function CLIQuery.on_global_arguments at 0x1014f8950>]
Event: CommandInvoker.OnPreCommandTableCreate []
Event: CommandLoader.OnLoadArguments []
Event: CommandInvoker.OnPostCommandTableCreate []
Event: CommandInvoker.OnCommandTableLoaded []
Event: CommandInvoker.OnPreParseArgs []
Event: CommandInvoker.OnPostParseArgs [<function OutputProducer.handle_output_argument at 0x1014c9730>, <function CLIQuery.handle_query_parameter at 0x1014f89d8>]
msrest.universal_http.requests : Configuring retry: max_retries=3, backoff_factor=0.8, max_backoff=90
msrest.service_client : Accept header absent and forced to application/json
msrest.universal_http : Configuring redirects: allow=True, max=30
msrest.universal_http : Configuring request: timeout=100, verify=True, cert=None
msrest.universal_http : Configuring proxies: ''
msrest.universal_http : Evaluate proxies against ENV settings: True
urllib3.connectionpool : Starting new HTTP connection (1): localhost
msrest.exceptions : Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)
Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 440, in send timeout=timeout File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 639, in urlopen _stacktrace=sys.exc_info()[2]) File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 333, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise raise value File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 601, in urlopen chunked=chunked) File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 389, in _make_request self._raise_timeout(err=e, url=url, timeout_value=read_timeout) File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 309, in _raise_timeout raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value) urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 137, in send **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 508, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 618, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 521, in send raise ReadTimeout(e, request=request) requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/knack/cli.py", line 206, in invoke cmd_result = self.invocation.execute(args) File "/usr/local/lib/python3.7/site-packages/knack/invocation.py", line 188, in execute cmd_result = parsed_args.func(params) File "/usr/local/lib/python3.7/site-packages/knack/commands.py", line 105, in call return self.handler(*args, **kwargs) File "/usr/local/lib/python3.7/site-packages/knack/commands.py", line 212, in _command_handler result = op(client, **command_args) if client else op(**command_args) File "/usr/local/lib/python3.7/site-packages/sfctl/custom_app_type.py", line 117, in provision_application_type request, header_parameters, body_content_sorted) File "/usr/local/lib/python3.7/site-packages/msrest/service_client.py", line 336, in send pipeline_response = self.config.pipeline.run(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/init.py", line 197, in run return first_node.send(pipeline_request, **kwargs) # type: ignore File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/init.py", line 150, in send response = self.next.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/requests.py", line 72, in send return self.next.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/requests.py", line 137, in send return self.next.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/init.py", line 150, in send response = self.next.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/pipeline/requests.py", line 193, in send self.driver.send(request.http_request, **kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 328, in send return super(RequestsHTTPSender, self).send(request, **requests_kwargs) File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 140, in send raise_with_traceback(ClientRequestError, msg, err) File "/usr/local/lib/python3.7/site-packages/msrest/exceptions.py", line 51, in raise_with_traceback raise error.with_traceback(exc_traceback) File "/usr/local/lib/python3.7/site-packages/msrest/universal_http/requests.py", line 137, in send **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 508, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 618, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 521, in send raise ReadTimeout(e, request=request) msrest.exceptions.ClientRequestError: Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100) msrest.universal_http.requests : Configuring retry: max_retries=3, backoff_factor=0.8, max_backoff=90 Performing cluster version check msrest.universal_http : Configuring redirects: allow=True, max=30 msrest.universal_http : Configuring request: timeout=100, verify=True, cert=None msrest.universal_http : Configuring proxies: '' msrest.universal_http : Evaluate proxies against ENV settings: True urllib3.connectionpool : Starting new HTTP connection (1): localhost urllib3.connectionpool : http://localhost:19080 "GET /$/GetClusterVersion?api-version=6.4&timeout=60 HTTP/1.1" 200 23`
Hi @jfalameda, sorry to hear that you're running into this issue. To help us debug, can you share the output from running this command with the --debug option? Could you also share the cluster traces and the approximate time at which you ran the provision command? The traces would be in /etc/servicefabric/FabricDataRoot. Thanks!
Hi @Christina-Kang,
It seems that the provision is taking few minutes and the timeout for the call to provision is 100.
You can download the tracing from here:
https://drive.google.com/file/d/17cAPVzd6v3v7PULhgxyN2_h4SSGRIMmL/view?usp=sharing
Kind Regards, / José.
Hi @jfalameda,
Thanks for your patience! It looks like the actual timeout duration is not the issue here, but rather, an issue with permissions, which holds up the operation until it eventually times out.
Issue is with certain folders not being ACLed correctly. From your traces, it looks like the source is this:
/home/ClusterDeployer/ClusterData/Data/N0030/Fabric/work/Applications/__FabricSystem_App4294967295/work/Store
and the destination is this: /home/ClusterDeployer/ClusterData/Data/N0030/Fabric/work/ImageBuilderProxy/
There are also paths under different "nodes", represented in this path by "N0010" instead of the "N0030" in the previous path: /home/ClusterDeployer/ClusterData/Data/N0010/Fabric/work/Applications/__FabricSystem_App4294967295/work/Store
. The same is for the destination path.
Can you check all these paths (N0010, N0020, N0030) to make sure that users have the correct permissions, and if not, chmod everything under those paths to give user permissions to rwx?
You should be able to retry upload after this is done. You should not need to, but if the above doesn't work, try restarting fabric and seeing if that plus the permission changes allows provision to continue. Please let me know if this mitigates the issue for you.
This issue happens rarely, so if you were to have a cluster in azure, or on a different dev machine, this issue most likely won't reoccur.
Hope this helps! Christina
Hi @Christina-Kang ,
I can see the application provisioning now. I still get the timeout error as the application takes a lot of time to provision.
It is weird that I need to fix permission, as I am using the docker based service fabric.
Thanks.
Kind Regards, / José.
Hi @jfalameda,
Is it timing out with the same error / stack as posted above still? Where the timeout is 100 for read timeouts, rather than the passed in time out.
Thanks!
Hi @Christina-Kang ,
Yes, the application times out the same way. I can see on SF that the application is still provisioning. The application disappears when the provisioning is done without any error, therefore I cannot publish it.
PS: After checking the checksums it shows "unprovisioning." then it disappears from the applications list.
Kind Regards, / José.
Hi @Christina-Kang
Unfortunately 7.1.0 didn't resolve this issue. The timeout is not about uploading the application, but to provision it. The application takes few minutes to provision, while sfctl timeouts before it has completed. By the time it is 100% provisioned, the application unprovision itself automatically.
This is the error I get on the client, I can see the applicaiton is still provisioning on the service fabric explorer.
Error occurred in request., ReadTimeout: HTTPConnectionPool(host='localhost', port=19080): Read timed out. (read timeout=100)
Traceback (most recent call last):
File "/home/jose/.local/lib/python2.7/site-packages/knack/cli.py", line 206, in invoke
cmd_result = self.invocation.execute(args)
File "/home/jose/.local/lib/python2.7/site-packages/sfctl/entry.py", line 81, in execute
return super(SFInvoker, self).execute(args)
File "/home/jose/.local/lib/python2.7/site-packages/knack/invocation.py", line 188, in execute
cmd_result = parsed_args.func(params)
File "/home/jose/.local/lib/python2.7/site-packages/knack/commands.py", line 105, in __call__
return self.handler(*args, **kwargs)
File "/home/jose/.local/lib/python2.7/site-packages/knack/commands.py", line 212, in _command_handler
result = op(client, **command_args) if client else op(**command_args)
File "/home/jose/.local/lib/python2.7/site-packages/sfctl/custom_app_type.py", line 117, in provision_application_type
request, header_parameters, body_content_sorted)
File "/home/jose/.local/lib/python2.7/site-packages/msrest/service_client.py", line 336, in send
pipeline_response = self.config.pipeline.run(request, **kwargs)
File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/__init__.py", line 197, in run
return first_node.send(pipeline_request, **kwargs) # type: ignore
File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/__init__.py", line 150, in send
response = self.next.send(request, **kwargs)
File "/home/jose/.local/lib/python2.7/site-packages/msrest/pipeline/requests.py", line 72, in send
return self.next.send(request, **kwargs)
Kind Regards, / Jose.
@jfalameda While the timeout may occur on the client side (this is a bug on our part, that the timeout can occur before the cluster times out), the cluster should continue its work with provisioning. There should be no unprovision happening, even on error, from the cluster side. If you are running the upload / provision / create process as part of a script, this may be causing the unprovisioning to happen. If this is not the case, are you available for a call with us so we can debug this faster? You can contact me at [email protected]
Thanks!