openverse-api
openverse-api copied to clipboard
Use a separate service to generate audio waveforms
Problem
Currently the audio waveforms are created upon request in the API. This has two effects:
- The API must either cache the waveforms in it's own table or else recreate them upon every request
- The API (which is open to the world) has an extra binary installed,
audiowaveformwhich could present an unnecessary vulnerability
Description
From @AetherUnbound in a private chat:
I would like to see if it would be possible to generate them during ingestion. If that proves feasible, it has the added benefit of removing the audiowaveform dependency from the API. We can start populating the waveforms now in the catalog, and then swap over to the column in the API once we’ve backfilled everything.
Alternatives
Continue just creating the waveforms in the API and accept the two issues in the description.
Implementation
- [ ] 🙋 I would be interested in implementing this feature.
I imagine it'll be a very interesting exercise to write a thin API wrapper (using something fast and close to the metal like Go) over the BBC audiowaveform library. Given a URL it can return (and possibly also cache) the waveform. This allows us to make it work similar to the imageproxy thumbnail service in the API without tightly coupling it to the API or the ingestion server.
Oh interesting idea Dhruv. Do you mean basically writing a wrapper around audiowaveform that memoizes the calls to it? I could even see this being some kind of general purpose CLI utility that creates a unix socket to make the requests against or something. Though if you used something with a fast boot time (idk if Go qualifies for that, I can't find any research online of how fast Go binaries boot vs Rust binaries; but based on what I'm reading online Go is fast but mostly fast at compilation and Rust will have faster execution speeds, at the cost of compilation of course) then you could just call the binary directly each time and have it establish a configured connection with Redis or the like. A long lived-daemon might shave some time there though.
However, I have to say that might be over-complicating it when you could just memoize the calls to the audiowaveform binary in Python against a long-lived Redis cache? py-memoize for example can accommodate a Redis or memcached backend.
Then again, all of that might be over-complicating it when storing it in the database could be the simplest solution and provide sufficient performance anyway.
My vote would probably be to go the simple database route, measure the performance, and if we see some noticeable peaks in the 95P+ range then look into improving it.
That being said, there are probably other parts of our stack, especially in the API, that could use that same kind of analysis and I'm eager for us to get some monitoring in place that will allow us to do that.
Can we close this issue now that WordPress/openverse-catalog#529 has been merged (and deployed 🚀), or is this a more broad issue, @sarayourfriend ?
I think this is still an issue that needs to be directly addressed. Relying on manually running a django command to "warm the cache" of waveforms, as it were, is not a sustainable (or desirable) solution in the long term.
As discussed recently, I'm leaning towards the following:
I think we will want to codify a pattern for things like waveforms, thumbnails, etc.—anything that isn’t a necessary piece of data to provide in the API, but that we’d like to display in the front-end, should be generated dynamically on read, and cached aggressively. If it’s not ‘data’ it shouldn’t be in the catalog, but a reference to it could be served by the API (for example thumbnail_url , waveform_url, etc.).
Our dataset is so large that I don’t think running computations against media during ingestion is going to work.
So basically, I don't personally think we should warm the cache at all. To revisit the original problems:
- The API must either cache the waveforms in it's own table or else recreate them upon every request
With my discussed approach, we would remove any waveform data from the DB and instead treat them like we do image thumbnails, where the API response includes a reference to the waveform data
- The API (which is open to the world) has an extra binary installed, audiowaveform which could present an unnecessary vulnerability
This waveform generator would become a standalone microservice.
I'm open to closing this issue, I don't think it explicitly relates to the catalog anymore.
Let's move the issue to either openverse or openverse-api, wherever we thing it'd make the most sense to record the need for an entirely new service.
Moved the issue to the API, since that's where the thumbnail service currently resides.
@krysal I'm going to move this out of the todo column, I don't think it's a realistic goal for the next two weeks and their might be some infra considerations to deal with first.