aspire icon indicating copy to clipboard operation
aspire copied to clipboard

SQL Timeouts with proxied connections

Open afscrome opened this issue 9 months ago • 4 comments

I've been facing a number of sparodic SQL timeouts - performance generally seems fine, but a number of timeouts show up in the logs. On a hunch, I tried disabling proxying of the sql endpoint and since then I haven't seen a timeout again.

//Try to mitigate timeouts
var endpoint= sql.Resource.Annotations.OfType<EndpointAnnotation>()
    .Single(x => x.Name == sql.Resource.PrimaryEndpoint.EndpointName);
endpoint.IsProxied = false;

Any thoughts on what could be causing the timeouts, or how to trace down the issue.

We are currently using System.SqlClient. We do have a branch in flight that's upgrading to Microsoft.SqlClient, but tha'ts not quite ready to test with yet. We're seeing this in a couple of apps - both of which are windows exes talking to SQL running in a container.

I haven't caught any of these timeouts myself - I find them mainly in the logs from background tasks - I wonder if there is some kind of connection idle / time out that's not being fully handled.

Exceptions are all of the following form:

System.Data.SqlClient.SqlException (0x80131904): Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): The wait operation timed out.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
   at System.Data.SqlClient.SqlCommand.ExecuteReader()
   at MYAPP.FOO.BAR ...

afscrome avatar May 15 '24 14:05 afscrome

cc @karolz-ms @danegsta

davidfowl avatar May 16 '24 04:05 davidfowl

Upgraded to Microsoft.Data.SqlClient and the timeouts still persist

afscrome avatar May 20 '24 12:05 afscrome

After adding ;Pooling=false to the connection string, and I haven't seen timeouts for 2 hours now, so it does seem like this is something to do with connection reuse - perhaps connection idle time differing between the app, sql server and the proxy.

afscrome avatar May 20 '24 18:05 afscrome

Got a repro.

  1. Spin up an Aspire app with sql server in aspire
var builder = DistributedApplication.CreateBuilder(args);
var password = builder.AddParameter("SqlPassword", true);
builder.AddSqlServer("SqlServer", password: password, port: 14431);
await builder.Build().RunAsync();
  1. Spin up a console app, replacing the password on line 1 with your password:
var password = "";
var cs = $"Server=127.0.0.1,14431;User ID=sa;Password={password};TrustServerCertificate=true";
var interval = TimeSpan.FromMilliseconds(500);
var concurrency = 10;

var cts = new CancellationTokenSource();
var startTime = DateTime.Now;
try
{
    await Task.WhenAll(Enumerable.Range(0, concurrency).Select(_ => DoWork()));
}
finally
{
    Console.WriteLine($"Start: {startTime}");
    Console.WriteLine($"End: {DateTime.UtcNow}");
}

async Task DoWork()
{
    await Task.Yield();

    while (true)
    {
        using (var connection = new SqlConnection(cs))
        {
            await connection.OpenAsync();
            var cmd = new SqlCommand("select 1", connection);
            try
            {
                await cmd.ExecuteNonQueryAsync();
                Console.WriteLine($"{connection.ClientConnectionId} @ {DateTime.Now}");
            }
            catch (Exception ex)
            {
                cts.Cancel();
                Console.Beep();
                Console.Beep();
                Console.Beep();
                throw new Exception($"CONNECTION FAILED: {connection.ClientConnectionId} @ {DateTime.Now}", ex);
            }
        }
        await Task.Delay(interval + TimeSpan.FromMilliseconds(Random.Shared.Next(-200,200)));
        if (cts.IsCancellationRequested)
        {
            return;
        }
    }
}

I've run this app 5 times now and reproduced 5 times, taking anywhere from 1 minute to 5 minutes before failing. Also reproed on a second non work machine to try and rule out any funny VPN configurations - on my personal machine I reproed in roughly 5 mins.

afscrome avatar May 20 '24 19:05 afscrome

I can reproduce. Investigating...

karolz-ms avatar May 30 '24 21:05 karolz-ms

I think I know what the problem is. The proxy has a bug that may cause it to fail to send a packet when an internal heartbeat occurs between a read from the client and a write to the server. 😬

karolz-ms avatar Jun 04 '24 01:06 karolz-ms

Thanks for digging in @karolz-ms! Thanks for poking at this @afscrome !

davidfowl avatar Jun 04 '24 05:06 davidfowl

The fix has been merged into DCP main branch and it is on its way into first Aspire service release. Thanks again to @afscrome for spotting the issue and a very good repro; it helped a ton.

karolz-ms avatar Jun 05 '24 18:06 karolz-ms