opentelemetry-operations-python icon indicating copy to clipboard operation
opentelemetry-operations-python copied to clipboard

Running on multiple AppEngine instances

Open mjvankampen opened this issue 2 years ago • 7 comments

I've been setting up OTel metrics with Google Cloud Monitoring for our Django app running on AppEngine.

I used code that looked like this

metrics_exporter = CloudMonitoringMetricsExporter(
    project_id=GOOGLE_CLOUD_PROJECT, add_unique_identifier=True
)
metric_reader = PeriodicExportingMetricReader(
    exporter=metrics_exporter, export_interval_millis=5000
)

resource = Resource.create(
    {
        "service.name": env("GAE_SERVICE", default="cx-api"),
        "service.namespace": "Our Platform",
        "service.instance.id": env("GAE_INSTANCE", default="local"),
        "service.version": env("GAE_VERSION", default="local"),
    }
)
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(tracer_provider)

metrics.set_meter_provider(
    MeterProvider(
        metric_readers=[metric_reader],
        # As GCP only allows writing to a timeseries once per second we need to make sure every instance has a different
        # series by setting a unique instance id
        resource=resource,
    )
)

I thought that by using a unique instance ID I would get around the One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric issue. But it seems I really need to use add_unique_identifier=True. While I understand this for multithreaded applications or multiple exporters for the same resource, I don't understand it if the resource is already unique.

mjvankampen avatar Dec 01 '22 07:12 mjvankampen

Even with add_unique_identifier set to true sometimes the error still pops up

mjvankampen avatar Dec 01 '22 12:12 mjvankampen

I think I found why it still sometimes happens. This coincides with a scale-down of an AppEngine instance. This kind of makes sense if metrics are flushed on shutdown.

mjvankampen avatar Dec 02 '22 19:12 mjvankampen

Is your app being run in multiple processes e.g. gunicorn pre-fork model?

If not, then the shutdown final flush is the likely culprit. The shortest allowed interval to publish metrics to Cloud Monitoring is 5s. If you're app flushes metrics before shutdown and the previous export ran within the last 5s you can see this error. Do you see any issues in your dashboards/metrics?

aabmass avatar Jan 19 '23 20:01 aabmass

Hey, I hope it's okay to latch onto this question. What would the answer be if the app was run in gunicorn? I'm currently testing exporting metrics to Cloud Monitoring and during local development (with a single-process flask development server), everything works fine. But when I deploy to Cloud Run (gunicorn with multiple workers), I frequently get One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.

I'm setting up the reader/exporter with add_unique_identifier=True, like this:

PeriodicExportingMetricReader(
    CloudMonitoringMetricsExporter(add_unique_identifier=True),
    export_interval_millis=5000,
)

Any tips on how to avoid this? Thanks a lot!

nsaef avatar Apr 13 '23 11:04 nsaef

Hi! Thanks for the lib. I see a similar issue:

One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point

This is a uvicorn app on Cloud Run with 2 containers, running 2 workers each.

A minimal repro of the code:

exporter = CloudMonitoringMetricsExporter(project_id=gcp_project_id)
reader = PeriodicExportingMetricReader(exporter)
detector = GoogleCloudResourceDetector(raise_on_error=True)
provider = MeterProvider(metric_readers=[reader], resource=detector.detect())
meter = provider.get_meter(name="api")
# The meter object is injected into multiple classes, each which create their own instruments, e.g.:
latency = meter.create_histogram(name="api.latency", unit="ms")

kendra-human avatar May 11 '23 03:05 kendra-human

I'd really like to fix this in an automatic way but not sure the best way to go about it. Are you able to defer the initialization of OpenTelemetry MeterProvider until after the workers start (post-fork)?

aabmass avatar Jul 17 '23 19:07 aabmass