hackage-server Reverse dependencies: Exploring a better data model and storage design

Context: https://github.com/haskell/hackage-server/issues/40

@gbaz and I talked about exploring the data model and storage design space to store reverse-dependencies.

From what I can read in packdeps' source code (which does an admirable job at this task), a binary file produced from processing the Hackage index is used to store the necessary information.

@phadej @snoyberg Your experience is more than welcome on this thorny issue. :)

Aug 02 '21 20:08 Kleidukos

The unmerged PR for revdeps lives here: https://github.com/haskell/hackage-server/pull/723

There may be useful UI code, etc. in there. The obstacle to merging was the memory footprint.

My overall feeling is the problem is we construct a complex structure that tries to answer every possible question. Instead we should enumerate the few key questions we want to answer. In my mind, that's as follows:

Immediate rev-deps
Total count of transitive rev-deps
Packages pinned to no more than a certain version.

My suggestion was that a simpler architecture could be to use a standard relational database to encode the dependency constraints -- for each package/version we simply record each other package name it depends on, as well as the min and max version bounds. (approximating when the bounds look weirder). This lets us query directly for immediate rev-deps and cache/store them in memory, and also lets us do the more restrictive query to answer #3.

Periodically, we could run "rollup" queries to calculate the total set of transitive rev-deps, and store just the count in memory to cut down on footprint.

I suspect this would give a much reduced memory footprint compared to the existing code -- although I will point out that the existing code is pretty neat in that it has like navigable javascript graphs and soforth. So maybe we could just move to constructing the existing datastructure for it out of band as well and storing the data for those graphs in blob storage or something, if we think they're sufficiently cool. That said, I haven't played with the existing implementation and explored how neat/usable/useful the graphs really are in quite some time, so I'd encourage someone interested to explore a bit...

Aug 02 '21 20:08 gbaz

Periodically, we could run "rollup" queries to calculate the total set of transitive rev-deps, and store just the count in memory to cut down on footprint.

Yep, this is definitely something I would go with.

Aug 02 '21 20:08 Kleidukos

I agree with @gbaz for the most part. However, I am a little confused about the 'memory footprint' being referred to. If we're using a standard relational schema to encode dependency information (which is a graph), the memory cost would be on-disk (since I don't think there's a good in-memory database solution that would do what we want and is open-source). There is a quadratic cost in size involved (since you're basically storing something like an adjacency matrix), but this is disk, not RAM, so I don't think it's that big a deal.

I may be misunderstanding the scope of the issue though - clarification is welcome. My current 10 cents more or less amounts to:

Encode the dependency graph of everything properly, using standard relational methods;
Write a (parameterized) query to do this kind of lookup.

This should be both reasonably fast, and not that difficult to do (or at least, I know how I would do it). Am I missing something?

Aug 04 '21 21:08 kozross

The memory footprint as discussed referred solely to resident memory. The prior design was entirely in memory, and that was (partly) the issue.

Sep 04 '21 04:09 gbaz

hackage-server hackage-server copied to clipboard

Reverse dependencies: Exploring a better data model and storage design

hackage-server
hackage-server copied to clipboard