trafficserver
                                
                                 trafficserver copied to clipboard
                                
                                    trafficserver copied to clipboard
                            
                            
                            
                        dns error logging to diags
Logs using SiteThrottled* version is helpful for noisy logs, being used here instead of print statements to help with the possible retries when a dns code occurs
[approve ci]
@fabikd : can you please run make -j8 format and commit and push the changes? That should satisfy the clang-format CI job.
[approve ci autest]
Right, I think it's a great idea to have @SolidWallOfCode look at it. I used the code from the master branch, https://github.com/apache/trafficserver/blob/master/iocore/dns/DNS.cc#L1556
should I have used a different version? I’ve looked at the existing code in the master branch and these 3 labels did show goto Lerror: default, NOTIMP, NOTZONE, and my intent was to leave goto Lerror as is. I do not see goto Lerror on this version of the code https://github.com/apache/trafficserver/blob/master/iocore/dns/DNS.cc#L1556 for FORMERR, NXDOMAIN, YXDOMAIN, YXRRSET, NXRRSET, NOTAUTH and it was not my intention to remove goto Lerror on any label.
As for the breaks, usually a switch statement has breaks after each case to avoid passing through the code under the next label. If I am using the correct version of the code, was the intention of it to not break and pass through all the named labels and use the last goto Lerror as a catch all?
https://github.com/apache/trafficserver/blob/master/iocore/dns/DNS.cc#L1581
If so, I can add goto Lerror to FORMERR, NXDOMAIN, YXDOMAIN, YXRRSET, NXRRSET, NOTAUTH. Just let me know! Thank you!! (:
I used the code from the master branch, https://github.com/apache/trafficserver/blob/master/iocore/dns/DNS.cc#L1556 should I have used a different version?
Your PR is correctly based on the master branch. Nothing to change there.
I’ve looked at the existing code in the master branch and these 3 labels did show
goto Lerror: default, NOTIMP, NOTZONE, and my intent was to leavegoto Lerroras is. I do not seegoto Lerroron this version of the code https://github.com/apache/trafficserver/blob/master/iocore/dns/DNS.cc#L1556 for FORMERR, NXDOMAIN, YXDOMAIN, YXRRSET, NXRRSET, NOTAUTH and it was not my intention to removegoto Lerroron any label.As for the breaks, usually a switch statement has breaks after each case to avoid passing through the code under the next label.
Developers must intentionally break, return, goto, or not based on the flow of code they desire. Without statements such as these, the runtime path will "fall through" to the following case statements.
If I am using the correct version of the code, was the intention of it to not break and pass through all the named labels and use the last
goto Lerroras a catch all? https://github.com/apache/trafficserver/blob/master/iocore/dns/DNS.cc#L1581 If so, I can addgoto Lerrorto FORMERR, NXDOMAIN, YXDOMAIN, YXRRSET, NXRRSET, NOTAUTH. Just let me know! Thank you!! (:
Correct. Unless break statements or the like are used, case statements fall through to next ones. Thus, in this block from master branch (before your change):
    case NOERROR:
    case NXDOMAIN:
    case 6:  // YXDOMAIN
    case 7:  // YXRRSET
    case 8:  // NOTAUTH
    case 9:  // NOTAUTH
    case 10: // NOTZONE
      Debug("dns", "DNS error %d for [%s]", h->rcode, e->qname);
      goto Lerror;
All of NOERROR, NXDOMAIN, etc., will fall through to that same Debug log line and goto Lerror. There's no need to repeat the Debug and goto Lerror statements for each case.
This can be seen with the following short example: https://godbolt.org/z/cqodTTKvq
Thank you again for working on this.
It is also very standard (from the early days of C coding, which I remember personally) to omit break for cases that have completely identical code, as in this situation.
Also note that for C++17 there is now an standard keyword that should (must?) be used for fallthrough in switch statements. E.g.
switch (value) {
  case 0:
    // code ...
    [[fallthrough]];
  case 1:
    // code ...
See here. Note the style for case 1: and case 2: in the example.
That said, the errors are a bit odd, particularly this one
         - 502 https://127.0.0.1:``/ 2 0
         ?                       ^^
         
         + 502 https://127.0.0.1:61712/ 2 0
         ?                       ^^^^^
My understanding was the double back tick indicated "anything goes" for that token. So what is the failure, exactly? The cert checking logic is a bit dependent on DNS due to how TLS vs. TCP connection errors are handled. It might be that the changes in the query reply code handling is causing this problem.
This may be related to the problem:
On master -
    case NXDOMAIN:
    case 6:  // YXDOMAIN
    case 7:  // YXRRSET
    case 8:  // NOTAUTH
    case 9:  // NOTAUTH
    case 10: // NOTZONE
      Debug("dns", "DNS error %d for [%s]", h->rcode, e->qname);
      goto Lerror;
In the PR
    case NXDOMAIN:
      SiteThrottledNote("NXDOMAIN: DNS error %d for [%s]: Domain name does not exist.", h->rcode, e->qname);
      retry     = false;
      tcp_retry = false;
      goto Lerror;
      break;
Note the change in handling of retry and tcp_retry. This shouldn't happen for a PR that is only changing the logging.
I returned the switch to its original order instead of chronological order, removed all the breaks, and retry in order to allow for the original intention of allowing the fallthrough. In theory this will print the statement that it is selected as well as all the subsequent statements, that comes after it. example:
 
switch (1){ case 0: std::cout << 0; case 1: std::cout << 1; case 2: std::cout << 2; case 3: std::cout << 3;}
Output: 123
 
When researching the logs, it would be confusing to see 123 and have to know that the log that applies is the 1st one. Do we still want to print to log in this case?
Also, I believe I can add [[fallthrough]]; after the code in each label and it might build the failed tests on Rocky, Fedora, Debian, and AuTest.  Because the Diags.h file is flagging all the fall through and failing them.
I'm just a passerby, so I'm sorry if I'm missing some background or previous discussion, but it seems like to me that all the log messages have the same format and you could avoid the repeat. Since DNS RCODE is basically an integer from 0 to 23, I'd simply replace the current Debug lines with:
SiteThrottledNote("%s: DNS error %d for [%s]: %s", RCODE_NAME[h->rcode], h->rcode, e->qname, RCODE_DESCRIPTION[h->rcode]);
I guess using some macros, constexpr, or Lexicon from libswoc would make this a bit more efficient.
https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-parameters-6
I'm just a passerby, so I'm sorry if I'm missing some background or previous discussion, but it seems like to me that all the log messages have the same format and you could avoid the repeat. Since DNS RCODE is basically an integer from 0 to 23, I'd simply replace the current
Debuglines with:SiteThrottledNote("%s: DNS error %d for [%s]: %s", RCODE_NAME[h->rcode], h->rcode, e->qname, RCODE_DESCRIPTION[h->rcode]);I guess using some macros,
constexpr, orLexiconfrom libswoc would make this a bit more efficient.https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-parameters-6
Thank you for your feedback!! I will give it a try! (:
[approve ci autest]
Seems fine. A couple of notes, although not worth delaying the merge.
For the names, we might consider using Lexicon, although the utility is less because it's a one way conversion. I might be tempted to have an array of string pairs, to group the tag and description, but then the use site is more complex. This is pretty deep in the style weeds so take them as potential ideas for future work.