openverse-api icon indicating copy to clipboard operation
openverse-api copied to clipboard

Use a separate service to generate audio waveforms

Open sarayourfriend opened this issue 3 years ago • 8 comments
trafficstars

Problem

Currently the audio waveforms are created upon request in the API. This has two effects:

  1. The API must either cache the waveforms in it's own table or else recreate them upon every request
  2. The API (which is open to the world) has an extra binary installed, audiowaveform which could present an unnecessary vulnerability

Description

From @AetherUnbound in a private chat:

I would like to see if it would be possible to generate them during ingestion. If that proves feasible, it has the added benefit of removing the audiowaveform dependency from the API. We can start populating the waveforms now in the catalog, and then swap over to the column in the API once we’ve backfilled everything.

Alternatives

Continue just creating the waveforms in the API and accept the two issues in the description.

Implementation

  • [ ] 🙋 I would be interested in implementing this feature.

sarayourfriend avatar Feb 14 '22 22:02 sarayourfriend

I imagine it'll be a very interesting exercise to write a thin API wrapper (using something fast and close to the metal like Go) over the BBC audiowaveform library. Given a URL it can return (and possibly also cache) the waveform. This allows us to make it work similar to the imageproxy thumbnail service in the API without tightly coupling it to the API or the ingestion server.

dhruvkb avatar Feb 15 '22 11:02 dhruvkb

Oh interesting idea Dhruv. Do you mean basically writing a wrapper around audiowaveform that memoizes the calls to it? I could even see this being some kind of general purpose CLI utility that creates a unix socket to make the requests against or something. Though if you used something with a fast boot time (idk if Go qualifies for that, I can't find any research online of how fast Go binaries boot vs Rust binaries; but based on what I'm reading online Go is fast but mostly fast at compilation and Rust will have faster execution speeds, at the cost of compilation of course) then you could just call the binary directly each time and have it establish a configured connection with Redis or the like. A long lived-daemon might shave some time there though.

However, I have to say that might be over-complicating it when you could just memoize the calls to the audiowaveform binary in Python against a long-lived Redis cache? py-memoize for example can accommodate a Redis or memcached backend.

Then again, all of that might be over-complicating it when storing it in the database could be the simplest solution and provide sufficient performance anyway.

My vote would probably be to go the simple database route, measure the performance, and if we see some noticeable peaks in the 95P+ range then look into improving it.

That being said, there are probably other parts of our stack, especially in the API, that could use that same kind of analysis and I'm eager for us to get some monitoring in place that will allow us to do that.

sarayourfriend avatar Feb 15 '22 11:02 sarayourfriend

Can we close this issue now that WordPress/openverse-catalog#529 has been merged (and deployed 🚀), or is this a more broad issue, @sarayourfriend ?

obulat avatar Mar 18 '22 05:03 obulat

I think this is still an issue that needs to be directly addressed. Relying on manually running a django command to "warm the cache" of waveforms, as it were, is not a sustainable (or desirable) solution in the long term.

sarayourfriend avatar Mar 18 '22 13:03 sarayourfriend

As discussed recently, I'm leaning towards the following:

I think we will want to codify a pattern for things like waveforms, thumbnails, etc.—anything that isn’t a necessary piece of data to provide in the API, but that we’d like to display in the front-end, should be generated dynamically on read, and cached aggressively. If it’s not ‘data’ it shouldn’t be in the catalog, but a reference to it could be served by the API (for example thumbnail_url , waveform_url, etc.).

Our dataset is so large that I don’t think running computations against media during ingestion is going to work.

So basically, I don't personally think we should warm the cache at all. To revisit the original problems:

  1. The API must either cache the waveforms in it's own table or else recreate them upon every request

With my discussed approach, we would remove any waveform data from the DB and instead treat them like we do image thumbnails, where the API response includes a reference to the waveform data

  1. The API (which is open to the world) has an extra binary installed, audiowaveform which could present an unnecessary vulnerability

This waveform generator would become a standalone microservice.

I'm open to closing this issue, I don't think it explicitly relates to the catalog anymore.

zackkrida avatar May 16 '22 20:05 zackkrida

Let's move the issue to either openverse or openverse-api, wherever we thing it'd make the most sense to record the need for an entirely new service.

sarayourfriend avatar May 19 '22 06:05 sarayourfriend

Moved the issue to the API, since that's where the thumbnail service currently resides.

AetherUnbound avatar May 24 '22 00:05 AetherUnbound

@krysal I'm going to move this out of the todo column, I don't think it's a realistic goal for the next two weeks and their might be some infra considerations to deal with first.

zackkrida avatar Aug 09 '22 21:08 zackkrida