indexify icon indicating copy to clipboard operation
indexify copied to clipboard

Implement Comprehensive Monitoring in Python SDK

Open PulkitMishra opened this issue 4 months ago • 1 comments

Implement Comprehensive Monitoring for Long-running Workflows

Problem Description

Indexify currently lacks a comprehensive built-in solution for monitoring long-running workflows. This makes it difficult for users to track the progress, performance, and resource usage of their pipelines, especially in production environments.

Current Limitations

  1. Limited Progress Tracking: In remote_client.py, the invoke_graph_with_object method provides basic event information:

    print(f"[bold green]{event.event_name}[/bold green]: {event.payload}")
    

    However, this doesn't give a clear picture of overall progress or estimated completion time.

  2. No Performance Metrics: The FunctionWorker class in function_worker.py doesn't collect or report any performance metrics:

    class FunctionWorker:
        def __init__(self, workers: int = 1) -> None:
            self._executor: concurrent.futures.ProcessPoolExecutor = (
                concurrent.futures.ProcessPoolExecutor(max_workers=workers)
            )
    

    There's no tracking of execution time, memory usage, or CPU utilization.

  3. Lack of Centralized Logging: The current logging is scattered and inconsistent. For example, in agent.py:

    console.print(f"[bold]task-reporter[/bold] uploading output of size: {len(completed_task.outputs or [])}")
    

    This approach doesn't provide a centralized, queryable log of system events and errors.

  4. No Real-time Monitoring Interface: There's no built-in way for users to view the current state of their workflows in real-time.

Benefits of Implementing Monitoring

  1. Improved Observability: Users will be able to track the progress of their workflows, identify bottlenecks, and estimate completion times.
  2. Performance Optimization: Collected metrics will help users optimize their workflows and resource allocation.
  3. Easier Debugging: Comprehensive logging and error reporting will make it easier to identify and fix issues in complex workflows.
  4. Resource Management: Monitoring resource usage will help prevent out-of-memory errors and optimize cloud resource allocation.

Proposed Solution

Implement a comprehensive monitoring system with the following components:

  1. Metrics Collection:

    • Add a Metrics class to collect and aggregate performance data.
    • Instrument key methods in FunctionWorker, Graph, and RemoteClient to collect metrics.
  2. Centralized Logging:

    • Implement a Logger class that provides structured logging with different severity levels.
    • Replace print statements with calls to the logger.
    • Add context information (e.g., graph name, function name) to log messages.
  3. Progress Tracking:

    • Extend the Graph class to include progress information for each node.
    • Implement a progress calculation algorithm that considers the graph structure.
    • Modify RemoteClient to report progress updates.
  4. Real-time Monitoring Interface:

    • Create a Monitor class that aggregates metrics, logs, and progress information.
    • Implement a simple web interface using Flask or FastAPI to display real-time monitoring data.
    • Create visualizations for metrics and progress (e.g., using Plotly).
  5. Alerting System:

    • Add configurable alerts for specific events or metric thresholds.
    • Implement notification mechanisms (e.g., email, Slack) for alerts.
  6. **Testing

    • Write unit tests for new classes and methods.
    • Update existing tests to work with new monitoring system.

Related Issues

  • #891 : Improve error handling in Python SDK

PulkitMishra avatar Oct 01 '24 06:10 PulkitMishra