frontier icon indicating copy to clipboard operation
frontier copied to clipboard

Logging Improvements for Frontier

Open whoAbhishekSah opened this issue 2 months ago • 0 comments

Overview

This document outlines comprehensive logging improvements for the Frontier gRPC Connect-based RPC server to enhance internal error debugging capabilities for developers. The recommendations focus on improving visibility across all layers of the Domain-Driven Design (DDD) architecture.

Current Logging Architecture

Strengths ✅

  • Structured JSON logging with Zap logger
  • Request ID propagation via X-Request-ID header
  • Connect RPC interceptor logging with duration, method, and error codes
  • Clean error mapping between layers (Repository → Service → Handler)
  • Comprehensive audit logging for user actions
  • Proper HTTP status code mapping from domain errors

Current Pain Points ❌

  • Generic "Internal Server Error" responses hide root causes in production
  • Minimal logging in service and repository layers
  • Lost context when errors bubble up through layers
  • No correlation between database query performance and errors
  • Difficult to trace error origins across the DDD layers
  • No automated alerting on error rate spikes

Detailed Logging Improvement Recommendations

1. Enhanced Error Context Propagation

Problem: Errors lose context as they bubble up from repository → service → handler layers.

Files to modify:

  • internal/store/postgres/user_repository.go:67
  • core/user/service.go:71
  • internal/api/v1beta1connect/user.go:42
  • Similar patterns across all repository and service files

Repository Layer Enhancement

// Example for internal/store/postgres/user_repository.go
func (r UserRepository) GetByID(ctx context.Context, id string) (user.User, error) {
    logger := grpczap.Extract(ctx)

    if strings.TrimSpace(id) == "" {
        return user.User{}, user.ErrInvalidID
    }

    var fetchedUser User
    userQuery, params, err := dialect.From(TABLE_USERS).
        Where(goqu.Ex{
            "id": id,
        }).Where(notDisabledUserExp).ToSQL()
    if err != nil {
        return user.User{}, fmt.Errorf("%w: %w", queryErr, err)
    }

    if err = r.dbc.WithTimeout(ctx, TABLE_USERS, "GetByID", func(ctx context.Context) error {
        return r.dbc.GetContext(ctx, &fetchedUser, userQuery, params...)
    }); err != nil {
        // Log the original database error with context
        logger.Error("database query failed",
            zap.String("operation", "user.GetByID"),
            zap.String("table", TABLE_USERS),
            zap.String("user_id", id),
            zap.String("query", userQuery),
            zap.Any("params", params),
            zap.String("error_type", fmt.Sprintf("%T", err)),
            zap.Error(err))

        err = checkPostgresError(err)
        switch {
        case errors.Is(err, ErrDuplicateKey):
            return user.User{}, fmt.Errorf("user lookup failed for id=%s: %w", id, user.ErrConflict)
        case errors.Is(err, sql.ErrNoRows):
            return user.User{}, fmt.Errorf("user not found for id=%s: %w", id, user.ErrNotExist)
        case errors.Is(err, ErrInvalidTextRepresentation):
            return user.User{}, fmt.Errorf("invalid UUID format for id=%s: %w", id, user.ErrInvalidUUID)
        default:
            return user.User{}, fmt.Errorf("database error in user.GetByID id=%s: %w", id, err)
        }
    }

    transformedUser, err := fetchedUser.transformToUser()
    if err != nil {
        logger.Error("user transformation failed",
            zap.String("operation", "user.GetByID"),
            zap.String("user_id", id),
            zap.Error(err))
        return user.User{}, fmt.Errorf("user transform error for id=%s: %w", id, err)
    }

    return transformedUser, nil
}

Service Layer Enhancement

// Example for core/user/service.go
func (s Service) Create(ctx context.Context, user User) (User, error) {
    logger := grpczap.Extract(ctx)
    logger.Info("creating user",
        zap.String("operation", "user.Create"),
        zap.String("email", user.Email),
        zap.String("name", user.Name))

    result, err := s.repository.Create(ctx, User{
        Name:     strings.ToLower(user.Name),
        Email:    strings.ToLower(user.Email),
        State:    Enabled,
        Avatar:   user.Avatar,
        Title:    user.Title,
        Metadata: user.Metadata,
    })

    if err != nil {
        logger.Error("user creation failed",
            zap.String("operation", "user.Create"),
            zap.String("email", user.Email),
            zap.String("name", user.Name),
            zap.String("error_type", fmt.Sprintf("%T", err)),
            zap.Error(err))
        return User{}, fmt.Errorf("service.Create failed for user %s: %w", user.Email, err)
    }

    logger.Info("user created successfully",
        zap.String("operation", "user.Create"),
        zap.String("user_id", result.ID),
        zap.String("email", result.Email))

    return result, nil
}

func (s Service) GetByID(ctx context.Context, id string) (User, error) {
    logger := grpczap.Extract(ctx)

    logger.Debug("fetching user",
        zap.String("operation", "user.GetByID"),
        zap.String("id", id))

    var result User
    var err error

    if isValidEmail(id) {
        result, err = s.GetByEmail(ctx, id)
    } else if utils.IsValidUUID(id) {
        result, err = s.repository.GetByID(ctx, id)
    } else {
        result, err = s.repository.GetByName(ctx, strings.ToLower(id))
    }

    if err != nil {
        logger.Error("user fetch failed",
            zap.String("operation", "user.GetByID"),
            zap.String("id", id),
            zap.String("lookup_type", getLookupType(id)),
            zap.String("error_type", fmt.Sprintf("%T", err)),
            zap.Error(err))
        return User{}, fmt.Errorf("service.GetByID failed for id=%s: %w", id, err)
    }

    return result, nil
}

func getLookupType(id string) string {
    if isValidEmail(id) {
        return "email"
    } else if utils.IsValidUUID(id) {
        return "uuid"
    }
    return "name"
}

2. Enhanced Handler Error Logging

Problem: Handlers map all internal errors to generic "Internal Server Error" without logging specifics.

Files to modify:

  • internal/api/v1beta1connect/user.go
  • All handler files in internal/api/v1beta1connect/*.go
// Example for internal/api/v1beta1connect/user.go
func (h *ConnectHandler) CreateUser(ctx context.Context, request *connect.Request[frontierv1beta1.CreateUserRequest]) (*connect.Response[frontierv1beta1.CreateUserResponse], error) {
    logger := grpczap.Extract(ctx)
    requestID := request.Header().Get(consts.RequestIDHeader)

    if request.Msg.GetBody() == nil {
        logger.Warn("create user request missing body",
            zap.String("operation", "CreateUser"),
            zap.String("request_id", requestID))
        return nil, connect.NewError(connect.CodeInvalidArgument, ErrBadRequest)
    }

    email := strings.TrimSpace(request.Msg.GetBody().GetEmail())
    if email == "" {
        currentUserEmail, ok := authenticate.GetEmailFromContext(ctx)
        if !ok {
            logger.Warn("create user missing email and no authenticated context",
                zap.String("operation", "CreateUser"),
                zap.String("request_id", requestID))
            return nil, connect.NewError(connect.CodeInvalidArgument, ErrBadRequest)
        }
        currentUserEmail = strings.TrimSpace(currentUserEmail)
        if currentUserEmail == "" {
            logger.Error("authenticated user has empty email",
                zap.String("operation", "CreateUser"),
                zap.String("request_id", requestID))
            return nil, connect.NewError(connect.CodeInvalidArgument, ErrBadRequest)
        }
        email = currentUserEmail
    }

    title := request.Msg.GetBody().GetTitle()
    name := strings.TrimSpace(request.Msg.GetBody().GetName())
    if name == "" {
        name = str.GenerateUserSlug(email)
    }

    var metaDataMap metadata.Metadata
    if request.Msg.GetBody().GetMetadata() != nil {
        metaDataMap = metadata.Build(request.Msg.GetBody().GetMetadata().AsMap())
        if err := h.metaSchemaService.Validate(metaDataMap, userMetaSchema); err != nil {
            logger.Warn("metadata validation failed",
                zap.String("operation", "CreateUser"),
                zap.String("request_id", requestID),
                zap.String("email", email),
                zap.Error(err))
            return nil, connect.NewError(connect.CodeInvalidArgument, ErrBadBodyMetaSchemaError)
        }
    }

    logger.Info("creating user",
        zap.String("operation", "CreateUser"),
        zap.String("request_id", requestID),
        zap.String("email", email),
        zap.String("name", name))

    newUser, err := h.userService.Create(ctx, user.User{
        Title:    title,
        Email:    email,
        Name:     name,
        Avatar:   request.Msg.GetBody().GetAvatar(),
        Metadata: metaDataMap,
    })

    if err != nil {
        // Log detailed error info before generic mapping
        logger.Error("user service create failed",
            zap.String("operation", "CreateUser"),
            zap.String("request_id", requestID),
            zap.String("email", email),
            zap.String("name", name),
            zap.String("error_type", fmt.Sprintf("%T", err)),
            zap.Error(err))

        switch {
        case errors.Is(err, user.ErrConflict):
            return nil, connect.NewError(connect.CodeAlreadyExists, ErrConflictRequest)
        case errors.Is(errors.Unwrap(err), user.ErrKeyDoesNotExists):
            return nil, connect.NewError(connect.CodeInvalidArgument, ErrBadRequest)
        default:
            // Log additional context for internal errors
            logger.Error("unexpected error in CreateUser",
                zap.String("operation", "CreateUser"),
                zap.String("request_id", requestID),
                zap.String("user_email", email),
                zap.String("user_name", name),
                zap.String("error_chain", fmt.Sprintf("%+v", err)),
                zap.Error(err))
            return nil, connect.NewError(connect.CodeInternal, ErrInternalServerError)
        }
    }

    transformedUser, err := transformUserToPB(newUser)
    if err != nil {
        logger.Error("user transformation to protobuf failed",
            zap.String("operation", "CreateUser"),
            zap.String("request_id", requestID),
            zap.String("user_id", newUser.ID),
            zap.String("error_type", fmt.Sprintf("%T", err)),
            zap.Error(err))
        return nil, connect.NewError(connect.CodeInternal, ErrInternalServerError)
    }

    // Log successful creation
    logger.Info("user created successfully",
        zap.String("operation", "CreateUser"),
        zap.String("request_id", requestID),
        zap.String("user_id", newUser.ID),
        zap.String("email", newUser.Email))

    audit.GetAuditor(ctx, schema.PlatformOrgID.String()).
        LogWithAttrs(audit.UserCreatedEvent, audit.UserTarget(newUser.ID), map[string]string{
            "email":  newUser.Email,
            "name":   newUser.Name,
            "title":  newUser.Title,
            "avatar": newUser.Avatar,
        })

    return connect.NewResponse(&frontierv1beta1.CreateUserResponse{User: transformedUser}), nil
}

3. Database Query Performance Logging

Problem: No visibility into slow queries or database performance issues.

Files to modify:

  • All repository files in internal/store/postgres/
// Enhanced database operation logging
func (r UserRepository) GetByID(ctx context.Context, id string) (user.User, error) {
    logger := grpczap.Extract(ctx)
    startTime := time.Now()

    if strings.TrimSpace(id) == "" {
        return user.User{}, user.ErrInvalidID
    }

    var fetchedUser User
    userQuery, params, err := dialect.From(TABLE_USERS).
        Where(goqu.Ex{
            "id": id,
        }).Where(notDisabledUserExp).ToSQL()
    if err != nil {
        return user.User{}, fmt.Errorf("%w: %w", queryErr, err)
    }

    if err = r.dbc.WithTimeout(ctx, TABLE_USERS, "GetByID", func(ctx context.Context) error {
        return r.dbc.GetContext(ctx, &fetchedUser, userQuery, params...)
    }); err != nil {
        duration := time.Since(startTime)
        logger.Error("database operation failed",
            zap.String("operation", "user.GetByID"),
            zap.String("table", TABLE_USERS),
            zap.String("user_id", id),
            zap.Duration("query_duration", duration),
            zap.String("query", userQuery),
            zap.Any("query_params", params),
            zap.String("error_type", fmt.Sprintf("%T", err)),
            zap.Error(err))

        // ... existing error handling
    } else {
        duration := time.Since(startTime)

        // Log slow queries for performance monitoring
        if duration > 100*time.Millisecond {
            logger.Warn("slow database query detected",
                zap.String("operation", "user.GetByID"),
                zap.String("table", TABLE_USERS),
                zap.String("user_id", id),
                zap.Duration("query_duration", duration),
                zap.String("query", userQuery),
                zap.Any("query_params", params))
        }

        // Debug logging for all queries in development
        logger.Debug("database query completed",
            zap.String("operation", "user.GetByID"),
            zap.String("table", TABLE_USERS),
            zap.String("user_id", id),
            zap.Duration("query_duration", duration))
    }

    transformedUser, err := fetchedUser.transformToUser()
    if err != nil {
        logger.Error("user transformation failed",
            zap.String("operation", "user.GetByID"),
            zap.String("user_id", id),
            zap.Error(err))
        return user.User{}, fmt.Errorf("%w: %w", parseErr, err)
    }

    return transformedUser, nil
}

// Enhanced Create method with transaction logging
func (r UserRepository) Create(ctx context.Context, usr user.User) (user.User, error) {
    logger := grpczap.Extract(ctx)
    startTime := time.Now()

    if strings.TrimSpace(usr.Email) == "" || strings.TrimSpace(usr.Name) == "" {
        return user.User{}, user.ErrInvalidDetails
    }

    insertRow := goqu.Record{
        "name":       strings.ToLower(usr.Name),
        "email":      strings.ToLower(usr.Email),
        "title":      usr.Title,
        "avatar":     usr.Avatar,
        "created_at": goqu.L("now()"),
        "updated_at": goqu.L("now()"),
    }

    if usr.Metadata != nil {
        marshaledMetadata, err := json.Marshal(usr.Metadata)
        if err != nil {
            logger.Error("metadata marshaling failed",
                zap.String("operation", "user.Create"),
                zap.String("email", usr.Email),
                zap.Error(err))
            return user.User{}, fmt.Errorf("%w: %w", parseErr, err)
        }
        insertRow["metadata"] = marshaledMetadata
    }

    if usr.State != "" {
        insertRow["state"] = usr.State
    }

    createQuery, params, err := dialect.Insert(TABLE_USERS).Rows(insertRow).Returning(&User{}).ToSQL()
    if err != nil {
        return user.User{}, fmt.Errorf("%w: %w", queryErr, err)
    }

    tx, err := r.dbc.BeginTxx(ctx, nil)
    if err != nil {
        logger.Error("transaction begin failed",
            zap.String("operation", "user.Create"),
            zap.String("email", usr.Email),
            zap.Error(err))
        return user.User{}, err
    }

    var userModel User
    if err = r.dbc.WithTimeout(ctx, TABLE_USERS, "Create", func(ctx context.Context) error {
        return tx.QueryRowxContext(ctx, createQuery, params...).StructScan(&userModel)
    }); err != nil {
        duration := time.Since(startTime)

        logger.Error("user creation query failed",
            zap.String("operation", "user.Create"),
            zap.String("table", TABLE_USERS),
            zap.String("email", usr.Email),
            zap.String("name", usr.Name),
            zap.Duration("query_duration", duration),
            zap.String("query", createQuery),
            zap.Any("query_params", params),
            zap.String("error_type", fmt.Sprintf("%T", err)),
            zap.Error(err))

        if rollbackErr := tx.Rollback(); rollbackErr != nil {
            logger.Error("transaction rollback failed",
                zap.String("operation", "user.Create"),
                zap.String("email", usr.Email),
                zap.Error(rollbackErr))
        }

        err = checkPostgresError(err)
        switch {
        case errors.Is(err, ErrDuplicateKey):
            return user.User{}, user.ErrConflict
        default:
            return user.User{}, err
        }
    }

    if err = tx.Commit(); err != nil {
        duration := time.Since(startTime)
        logger.Error("transaction commit failed",
            zap.String("operation", "user.Create"),
            zap.String("email", usr.Email),
            zap.Duration("total_duration", duration),
            zap.Error(err))
        return user.User{}, err
    }

    duration := time.Since(startTime)
    logger.Info("user created successfully",
        zap.String("operation", "user.Create"),
        zap.String("user_id", userModel.ID),
        zap.String("email", usr.Email),
        zap.Duration("total_duration", duration))

    transformedUser, err := userModel.transformToUser()
    if err != nil {
        logger.Error("user transformation failed after creation",
            zap.String("operation", "user.Create"),
            zap.String("user_id", userModel.ID),
            zap.Error(err))
        return user.User{}, fmt.Errorf("%w: %w", parseErr, err)
    }

    return transformedUser, nil
}

4. Enhanced Request Correlation

Problem: Hard to trace a request's journey through all layers.

Files to modify:

  • pkg/server/connect_interceptors/logger.go
// Enhanced Connect interceptor logging
func UnaryConnectLoggerInterceptor(logger *zap.Logger, opts *LoggerOptions) connect.UnaryInterceptorFunc {
    if opts == nil {
        opts = NewLoggerOptions()
    }

    return func(next connect.UnaryFunc) connect.UnaryFunc {
        return func(ctx context.Context, req connect.AnyRequest) (connect.AnyResponse, error) {
            if !opts.decider(req.Spec().Procedure) {
                return next(ctx, req)
            }

            // Embed enhanced logger in context with request metadata
            requestID := req.Header().Get(consts.RequestIDHeader)
            enrichedLogger := logger.With(
                zap.String("request_id", requestID),
                zap.String("method", req.Spec().Procedure))
            ctx = context.WithValue(ctx, loggerContextKey, enrichedLogger)

            startTime := time.Now()

            // Log request start
            enrichedLogger.Debug("request started",
                zap.String("user_agent", req.Header().Get("User-Agent")),
                zap.String("remote_addr", req.Header().Get("X-Forwarded-For")),
                zap.String("content_type", req.Header().Get("Content-Type")))

            resp, err := next(ctx, req)
            duration := time.Since(startTime)

            code := connect.Code(0)
            if connectErr, ok := err.(*connect.Error); ok {
                code = connectErr.Code()
            }

            fields := []zap.Field{
                zap.String("system", "connect_rpc"),
                zap.Time("start_time", startTime),
                zap.String("method", req.Spec().Procedure),
                zap.Int64("time_ms", duration.Milliseconds()),
                zap.String("code", code.String()),
                zap.String("request_id", requestID),
                zap.String("user_agent", req.Header().Get("User-Agent")),
                zap.String("remote_addr", req.Header().Get("X-Forwarded-For")),
            }

            // Add response size if available
            if resp != nil {
                if size := resp.Header().Get("Content-Length"); size != "" {
                    fields = append(fields, zap.String("response_size", size))
                }
            }

            if err != nil {
                fields = append(fields,
                    zap.String("error_type", fmt.Sprintf("%T", err)),
                    zap.Bool("is_user_error", isUserError(code)),
                    zap.Error(err))
            }

            // Log completion with appropriate level
            if err == nil {
                enrichedLogger.Info("request completed", fields...)
                return resp, err
            }

            switch code {
            case connect.CodeCanceled:
                enrichedLogger.Warn("client cancelled request", fields...)
            case connect.CodeDeadlineExceeded:
                enrichedLogger.Warn("request timeout", fields...)
            case connect.CodeInvalidArgument,
                connect.CodeNotFound,
                connect.CodeAlreadyExists,
                connect.CodeUnauthenticated,
                connect.CodePermissionDenied,
                connect.CodeFailedPrecondition,
                connect.CodeOutOfRange:
                enrichedLogger.Warn("client error", fields...)
            default:
                enrichedLogger.Error("server error", fields...)
            }
            return resp, err
        }
    }
}

func isUserError(code connect.Code) bool {
    return code == connect.CodeInvalidArgument ||
           code == connect.CodeNotFound ||
           code == connect.CodeAlreadyExists ||
           code == connect.CodeUnauthenticated ||
           code == connect.CodePermissionDenied ||
           code == connect.CodeFailedPrecondition ||
           code == connect.CodeOutOfRange
}

5. Centralized Error Classification Utility

Problem: Error mapping is scattered across handlers with inconsistent logging.

New file: internal/api/v1beta1connect/error_handler.go

package v1beta1connect

import (
    "context"
    "fmt"

    "connectrpc.com/connect"
    "github.com/grpc-ecosystem/go-grpc-middleware/logging/zap/ctxzap"
    "github.com/pkg/errors"
    "go.uber.org/zap"

    "github.com/raystack/frontier/core/user"
    "github.com/raystack/frontier/core/organization"
    "github.com/raystack/frontier/core/project"
    // ... other domain imports
)

type ErrorHandler struct{}

func NewErrorHandler() *ErrorHandler {
    return &ErrorHandler{}
}

func (e *ErrorHandler) HandleServiceError(ctx context.Context, operation string, err error, contextFields ...zap.Field) error {
    logger := ctxzap.Extract(ctx)

    baseFields := []zap.Field{
        zap.String("operation", operation),
        zap.String("error_type", fmt.Sprintf("%T", err)),
        zap.String("error_chain", fmt.Sprintf("%+v", err)),
        zap.Error(err),
    }
    baseFields = append(baseFields, contextFields...)

    // User domain errors
    switch {
    case errors.Is(err, user.ErrNotExist):
        logger.Warn("user not found", baseFields...)
        return connect.NewError(connect.CodeNotFound, ErrUserNotExist)
    case errors.Is(err, user.ErrConflict):
        logger.Warn("user conflict", baseFields...)
        return connect.NewError(connect.CodeAlreadyExists, ErrConflictRequest)
    case errors.Is(err, user.ErrInvalidDetails):
        logger.Warn("invalid user details", baseFields...)
        return connect.NewError(connect.CodeInvalidArgument, ErrBadRequest)
    case errors.Is(err, user.ErrInvalidID):
        logger.Warn("invalid user ID", baseFields...)
        return connect.NewError(connect.CodeInvalidArgument, ErrBadRequest)
    case errors.Is(err, user.ErrInvalidUUID):
        logger.Warn("invalid user UUID", baseFields...)
        return connect.NewError(connect.CodeInvalidArgument, ErrBadRequest)

    // Organization domain errors
    case errors.Is(err, organization.ErrNotExist):
        logger.Warn("organization not found", baseFields...)
        return connect.NewError(connect.CodeNotFound, ErrNotFound)
    case errors.Is(err, organization.ErrConflict):
        logger.Warn("organization conflict", baseFields...)
        return connect.NewError(connect.CodeAlreadyExists, ErrConflictRequest)
    case errors.Is(err, organization.ErrDisabled):
        logger.Warn("organization disabled", baseFields...)
        return connect.NewError(connect.CodePermissionDenied, ErrOrganizationDisabled)

    // Project domain errors
    case errors.Is(err, project.ErrNotExist):
        logger.Warn("project not found", baseFields...)
        return connect.NewError(connect.CodeNotFound, ErrNotFound)
    case errors.Is(err, project.ErrConflict):
        logger.Warn("project conflict", baseFields...)
        return connect.NewError(connect.CodeAlreadyExists, ErrConflictRequest)

    // Generic validation errors
    case isValidationError(err):
        logger.Warn("validation error", baseFields...)
        return connect.NewError(connect.CodeInvalidArgument, ErrBadRequest)

    // Authentication/Authorization errors
    case isAuthError(err):
        logger.Warn("authentication/authorization error", baseFields...)
        return connect.NewError(connect.CodeUnauthenticated, ErrUnauthenticated)

    default:
        // Log all unhandled errors as server errors
        logger.Error("unhandled service error", baseFields...)
        return connect.NewError(connect.CodeInternal, ErrInternalServerError)
    }
}

func isValidationError(err error) bool {
    // Add checks for validation-related errors
    return false // Implement based on your validation error types
}

func isAuthError(err error) bool {
    // Add checks for authentication-related errors
    return false // Implement based on your auth error types
}

// Usage example in handlers
func (h *ConnectHandler) GetUser(ctx context.Context, request *connect.Request[frontierv1beta1.GetUserRequest]) (*connect.Response[frontierv1beta1.GetUserResponse], error) {
    errorHandler := NewErrorHandler()

    fetchedUser, err := h.userService.GetByID(ctx, request.Msg.GetId())
    if err != nil {
        return nil, errorHandler.HandleServiceError(ctx, "GetUser", err,
            zap.String("user_id", request.Msg.GetId()),
            zap.String("request_id", request.Header().Get(consts.RequestIDHeader)))
    }

    userPB, err := transformUserToPB(fetchedUser)
    if err != nil {
        return nil, errorHandler.HandleServiceError(ctx, "GetUser.Transform", err,
            zap.String("user_id", fetchedUser.ID))
    }

    return connect.NewResponse(&frontierv1beta1.GetUserResponse{
        User: userPB,
    }), nil
}

6. Metric-Based Error Monitoring

Problem: No automated alerting on error rate spikes.

Files to modify:

  • pkg/server/connect_interceptors/logger.go
import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

// Add metrics to the interceptor
var (
    requestCount = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "frontier_requests_total",
            Help: "Total number of requests processed",
        },
        []string{"method", "code"},
    )

    errorCount = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "frontier_errors_total",
            Help: "Total number of errors by type and layer",
        },
        []string{"method", "error_type", "layer", "code"},
    )

    requestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "frontier_request_duration_seconds",
            Help:    "Request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "code"},
    )

    slowQueryCount = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "frontier_slow_queries_total",
            Help: "Total number of slow database queries",
        },
        []string{"table", "operation"},
    )
)

func UnaryConnectLoggerInterceptor(logger *zap.Logger, opts *LoggerOptions) connect.UnaryInterceptorFunc {
    if opts == nil {
        opts = NewLoggerOptions()
    }

    return func(next connect.UnaryFunc) connect.UnaryFunc {
        return func(ctx context.Context, req connect.AnyRequest) (connect.AnyResponse, error) {
            if !opts.decider(req.Spec().Procedure) {
                return next(ctx, req)
            }

            startTime := time.Now()
            resp, err := next(ctx, req)
            duration := time.Since(startTime)

            code := connect.Code(0)
            if connectErr, ok := err.(*connect.Error); ok {
                code = connectErr.Code()
            }

            // Record metrics
            method := req.Spec().Procedure
            codeStr := code.String()

            requestCount.WithLabelValues(method, codeStr).Inc()
            requestDuration.WithLabelValues(method, codeStr).Observe(duration.Seconds())

            if err != nil {
                errorCount.WithLabelValues(
                    method,
                    fmt.Sprintf("%T", err),
                    "handler",
                    codeStr).Inc()
            }

            // ... existing logging code ...

            return resp, err
        }
    }
}

// Add to repository methods for database metrics
func (r UserRepository) recordSlowQuery(table, operation string, duration time.Duration) {
    if duration > 100*time.Millisecond {
        slowQueryCount.WithLabelValues(table, operation).Inc()
    }
}

Implementation Priority

High Priority (Immediate Impact) 🔴

  1. Enhanced handler error logging - Provides immediate debugging visibility
  2. Database operation logging with performance metrics
  3. Service layer contextual logging - Fills the current logging gap
  4. Request correlation enhancements - Improves traceability

Medium Priority (Next Phase) 🟡

  1. Centralized error classification utility - Reduces code duplication
  2. Error rate metrics - Enables monitoring and alerting
  3. Enhanced error context propagation - Improves error investigation

Low Priority (Future Improvements) 🟢

  1. Advanced query analysis and optimization logging
  2. Automated error alerting integration
  3. Error trend analysis and reporting

Example Debug Scenario

With these improvements implemented, when a user creation fails, developers will see complete tracing:

Before (Current State)

{
  "level": "error",
  "time": "2024-01-15T10:30:45Z",
  "msg": "finished call",
  "method": "/raystack.frontier.v1beta1.FrontierService/CreateUser",
  "code": "internal",
  "error": "Internal Server Error"
}

After (With Improvements)

{
  "level": "error",
  "time": "2024-01-15T10:30:45Z",
  "msg": "database query failed",
  "system": "connect_rpc",
  "request_id": "req_abc123",
  "method": "/raystack.frontier.v1beta1.FrontierService/CreateUser",
  "operation": "user.Create",
  "table": "users",
  "email": "[email protected]",
  "query": "INSERT INTO users (name, email, title, avatar, created_at, updated_at) VALUES ($1, $2, $3, $4, now(), now()) RETURNING *",
  "query_params": ["testuser", "[email protected]", "", "", "now()", "now()"],
  "query_duration": "2ms",
  "error_type": "*pq.Error",
  "error_chain": "service.Create failed for user [email protected]: database error in user.Create: pq: duplicate key value violates unique constraint \"users_email_key\"",
  "error": "duplicate key value violates unique constraint \"users_email_key\""
}

{
  "level": "error",
  "time": "2024-01-15T10:30:45Z",
  "msg": "user service create failed",
  "operation": "CreateUser",
  "request_id": "req_abc123",
  "email": "[email protected]",
  "name": "testuser",
  "error_type": "*fmt.wrapError",
  "error": "service.Create failed for user [email protected]: database error in user.Create: pq: duplicate key value violates unique constraint \"users_email_key\""
}

This provides complete context for debugging:

  • Original request details (request ID, email, name)
  • Failing operation (user.Create)
  • Actual SQL query and parameters
  • Root database error (constraint violation)
  • Error propagation chain through all layers
  • Performance metrics (query duration)

Benefits

  1. Faster Issue Resolution - Developers can immediately identify root causes
  2. Better Production Monitoring - Proactive detection of issues before they escalate
  3. Improved Code Quality - Visibility into performance bottlenecks
  4. Enhanced Debugging - Complete request tracing through all DDD layers
  5. Better Error Classification - Distinguishes between user errors and system errors
  6. Operational Insights - Database performance and system health visibility

Next Steps

  1. Start with High Priority items - Focus on immediate debugging improvements
  2. Implement incrementally - Roll out layer by layer to avoid disruption
  3. Test thoroughly - Ensure logging doesn't impact performance significantly
  4. Set up monitoring - Configure alerts based on new error metrics
  5. Document patterns - Create guidelines for consistent logging across teams
  6. Review and iterate - Gather feedback from developers and refine approach

whoAbhishekSah avatar Nov 12 '25 07:11 whoAbhishekSah