physionet-build icon indicating copy to clipboard operation
physionet-build copied to clipboard

S3: usage monitoring

Open bemoody opened this issue 2 years ago • 3 comments

(Splitting this off from #2093 as this is kind of a separate issue, and affects both public and restricted data.)

As we post data on Amazon S3, we want to be able to gather some metrics of usage. "Metrics" might include things like:

  • number of requests
  • number of bytes of data retrieved
  • number of bytes egressed from the Amazon network
  • number of distinct clients per day

Gathering such metrics isn't essential, nor do I particularly care which metrics we capture, but it would be highly desirable to have some way of measuring a project's usage. We should try to understand the monitoring services that Amazon provides, insofar as their pricing and technical limitations may have a major impact on how we want to structure the data buckets.

As usual, I have zero actual experience or inside knowledge and am trying to guess, based on the buzzword-infested public documentation, how the AWS system actually works and what services might provide what we're looking for.

bemoody avatar Sep 29 '23 15:09 bemoody

CloudWatch (https://docs.aws.amazon.com/AmazonS3/latest/userguide/cloudwatch-monitoring.html)

This provides (https://aws.amazon.com/cloudwatch/pricing/):

  • "Basic Monitoring" plus ten "Metrics" for free.
  • Additional "Metrics" costing $0.30/month.

These are supposedly provided with one-minute granularity.

I don't know what a "Metric" is - a single number? If we wanted to collect four Metrics per month for each published project, that's already a non-trivial expense.

Also note this (https://docs.aws.amazon.com/AmazonS3/latest/userguide/metrics-configurations.html):

You can have a maximum of 1,000 metrics configurations per bucket.

Is a "Metrics Configuration" the same thing as a "Metric"? I'm guessing not.

If we had 100 buckets, and we wanted to know the number of requests for each bucket, how many Metrics Configurations would be required? How many Metrics would we be charged for?

If we had 100 prefixes within a single bucket, and we wanted to know the number of requests for each prefix, how many Metrics Configurations would be required? How many Metrics would we be charged for?

bemoody avatar Sep 29 '23 15:09 bemoody

S3 Storage Lens (https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage_lens_basics_metrics_recommendations.html)

This provides (https://aws.amazon.com/s3/pricing/):

  • "Free metrics" which don't include any request or transfer metrics.
  • "Advanced metrics" which cost $0.20/month per million objects.

(FWIW, we currently have about 31 million files on PhysioNet.)

Metrics are supposedly provided with one-day granularity (or at least, the data is exported once per day.)

Here they document what things can be measured: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage_lens_metrics_glossary.html

"Advanced metrics" include "Prefix aggregation" which sounds like what we'd want. It's hard to find documentation, but the example JSON file shown here is suggestive: https://docs.aws.amazon.com/AmazonS3/latest/userguide/S3LensCLIExamples.html

bemoody avatar Sep 29 '23 15:09 bemoody

Finally, another possibility would be to store complete request logs (which can be dumped into another S3 bucket) and analyze them ourselves. The storage and transfer costs would likely be considerable.

bemoody avatar Sep 29 '23 15:09 bemoody