metacat icon indicating copy to clipboard operation
metacat copied to clipboard

use closure table for version chain

Open mbjones opened this issue 5 years ago • 16 comments

Our current approach to linking versions of objects in DataONE and Metacat is to provide a pointer in the SystemMetadata for each object that points at the objects that it obsoletes and those that are obsoletedBy it. This represents a doubly-linked list that can be traversed to reconstruct version information. The obsoletes and obsoletedBy fields are in the Metacat postgres database as columns in the system_metadata table, with the abbreviated structure:

guid, obsoletes, obsoleted_by

From a hierarchical data modeling perspective, this structure is called an adjacency list, and allows one to trace the version chain through a series of queries, each of which walks the version chain either forwards or backwards. Without support for hierarchical queries (which are proprietary and perform poorly), this generally means issuing n-1 queries, where n is the length of the version chain.

Much has been written about querying hierarchical data. Some good background reading is:

  • https://www.slideshare.net/billkarwin/models-for-hierarchical-data
  • https://dirtsimple.org/2010/11/simplest-way-to-do-tree-based-queries.html

From the first set of slides, and other reading, I have concluded that closure tables are a much more efficient and fast way to store and query this hierarchical data. It involves creating a new table for just the hierarchical information, and creating one row for each link in the version chain, including links at each level of the hierarchy. This table structure allows a single query to retrieve all parent and child information in the tree, and so is extremely fast, and only marginally more expensive in terms of storage. The articles linked above explain much more thoroughly than I will here, but the essences is that the closure table contains the following columns:

parent, child, depth

A simple query to get all of the versions for a given object would be:

select * from closure_table where parent = 'P1';

We'll need to develop out a more mature design, and an API to go along with this, but I think closure tables will allow us to provide efficient version information in our services with a new API. Below I attach my notes on the design and use of closure tables for reference.

closure-tables-notes.pdf

Comments and expansion of ideas appreciated. @csjx, @amoeba, @gothub, @taojing2002 @laurenwalker

mbjones avatar Feb 21 '20 03:02 mbjones

Interesting! Thank you for introducing this concept. I can see this getting utilized for the D1 Metrics Service, where we traverse back in the version chain to calculate the metrics.

rushirajnenuji avatar Feb 21 '20 05:02 rushirajnenuji

Really cool, @mbjones, I hadn't heard of a closure table before. I think having one or more O(1) APIs around version chains would be very helpful across current and future projects.

Another approach I implemented recently and mostly like are recursive common table expressions (CTE). My understanding is that they're O(n) but the O(n) happens inside the database engine so queries for even large hierarchies return in <1ms, as opposed to O(n) via HTTPS which is very expensive by comparison.

The possible benefit I see to CTEs over closure tables is that they don't require a second table and work on existing implicit hierarchies in your tables. As for downsides (from a quick web search) their performance seems to be on the same magnitude as closure tables but a bit slower.

Here's an example:

Given a table like our systemmetadata table in Metacat's database representing a chain from PIDs a->b->c:

guid obsoleted_by obsoletes
a b NULL
b c a
c NULL b

We can query for 'a's descendants (note that the string 'a' is embedded in the query below because we're querying for 'a's descendants):

WITH RECURSIVE
  children(x) AS (
    VALUES ('a')
  UNION
    SELECT guid FROM sysmeta, children
    WHERE sysmeta.obsoletes = children.x
  )
  SELECT guid FROM sysmeta
  WHERE sysmeta.guid IN (select * from children);
guid
a
b
c

A recursive CTE is really effectively asking the query engine to walk from row to row, forming unions with matched rows as it goes. Or at least that's my basic understanding. Food for thought.

amoeba avatar Feb 21 '20 21:02 amoeba

Thanks @amoeba for the thoughts on CTEs. In my post when I referred to "support for hierarchical queries", I was referring to CTEs, which were introduced in SQL:99 and are described on slide 17 of Karwin's slideset that I linked. Historically, each RDBMS has implemented the syntax and algorithm for CTEs in a proprietary way, even though they are all very similar. Postgres 9.5 first provided the SQL:99 version of WITH, whereas Oracle used to use CONNECT BY, and now implements both. Mysql also recently implemented them in their 2018 8.0.11 release. So it seems CTEs are widespread enough now to be used across platforms.

The tradeoff is this: CTE-based queries are much more complex to write; while closure tables are more complicated to maintain (probably involving triggers). But using closure tables, we can use really fast and simple queries to get complex results. Here's a quick example of some queries we would probably want to support via API.

create table version_closure (
    ancestor text NOT NULL, 
    descendant text NOT NULL, 
    depth integer, 
    PRIMARY KEY (ancestor, descendant)
);
INSERT INTO version_closure VALUES
    ('P1', 'P1', 0),
    ('P2', 'P2', 0),
    ('P3', 'P3', 0),
    ('P1', 'P2', 1),
    ('P2', 'P3', 1),
    ('P1', 'P3', 2);
  • All descendants (analogous to your CTE query above)
SELECT descendant FROM version_closure WHERE ancestor = 'P1' ORDER BY DEPTH;
  • Immediate descendants (but not grandchildren) (i.e., obsoletedBy)
SELECT descendant FROM version_closure WHERE ancestor = 'P1' AND depth = 1;
  • Latest descendant (i.e., latest version HEAD)
SELECT descendant FROM version_closure WHERE ancestor = 'P1' ORDER BY depth DESC LIMIT 1;
  • All ancestors (i.e., full version chain ordered newest to oldest)
SELECT ancestor FROM version_closure WHERE descendant = 'P3' ORDER BY depth;

So it's that simplicity that I like. But it comes at the expense of having to maintain the closure table. Even a year or two ago I would have said that CTEs were not widely implemented enough to rely on them, but as of 2018 I no longer think that is the case, and so we could choose to go that route as well. It certainly would be nice to not have a separate table. And it might be that, despite their complexity, we simply need to implement the CTE queries once as part of a new API that might include methods like getDescendants(pid), getFirstDescendant(pid), and getLastDescendant(pid), for example. So seems to me either approach could work.

Other pros and cons? Do we need new API methods like I described, or do we just need this version chain accessible from SOLR somehow and facetable? Let's discuss.

mbjones avatar Feb 22 '20 01:02 mbjones

To move this discussion along a bit since this functionality is becoming even more important, I'll suggest a modified version Matt's proposal for a new API. I definitely think we could use an API independent of the Solr index, but we could also populate the index as well. Given the usability issues with names like ascendant, descendant, antecedent, ancestor, predecessor, successor, etc., I'm wondering about the following API (hopefully fairly intuitive):

Given an obsolescence chain with an ordered pid list of:

A, B, C, D, E, F, G

Listing versions

(for these examples, id = "D")

  • listVersions(id) : VersionList - List all versions (antecedents and descendants) given the id (pid or sid), where VersionList looks something like this in XML:
<versions referenceId="D">
    <identifier>A</identifier>
    <identifier>B</identifier>
    <identifier>C</identifier>
    <identifier>D</identifier>
    <identifier>E</identifier>
    <identifier>F</identifier>
    <identifier>G</identifier>
</versions>

which could also be represented in JSON as something like:

{
    "referenceId": "D"
    "versions": ["A", "B", "C", "D", "E", "F", "G"]
}
  • listPriorVersions(id) : VersionList - List all prior versions (antecedents) of the given id
<versions referenceId="D">
    <identifier>A</identifier>
    <identifier>B</identifier>
    <identifier>C</identifier>
</versions>
  • listSubsequentVersions(id) : VersionList - List all subsequent versions (antecedents) of the given id
<versions referenceId="D">
    <identifier>E</identifier>
    <identifier>F</identifier>
    <identifier>G</identifier>
</versions>

These methods could potentially be reduced to a single call like:

  • listVersions(id, range="prior|subsequent") : VersionList or
  • listVersions(id, range="prior|subsequent", count=5) : VersionList (count could limit how many are returned prior to, subsequent to, or on both sides of id)

Getting a specific version

(for these examples, id = "D")

  • getFirstVersion(id) : Identifier - Get the first version (antecedent) given the id (pid or sid)
<identifier>A</identifier> # this would be a DataONE Types.Identifier
  • getPriorVersion(id) : Identifier - Get the prior version (antecedent) given the id (pid or sid)
<identifier>C</identifier>
  • getSubsequentVersion(id) : Identifier - Get the subsequent version (descendant) given the id (pid or sid)
<identifier>E</identifier>
  • getLastVersion(id) : Identifier - Get the last version(descendant) given the id (pid or sid)
<identifier>G</identifier>

These methods could potentially be reduced to:

  • getVersion(id, position="first|prior|subsequent|last") : Identifier

So, this warrants more thought. I'd love to hear what resonates or doesn't. I'm already wondering if subsequent should be next in these calls. 🤔

csjx avatar Sep 08 '20 21:09 csjx

@chris the reductionist version of these look good to me:

  • listVersions(id, range="prior|subsequent", count=5) : VersionList
  • getVersion(id, position="first|prior|subsequent|last") : Identifier

It's reasonable to follow the method name convention of getXXX returning a single value and listXXX returning multiple values, as that is mostly the convention followed by the DataONE API, MNCore.getLogRecords() being an exception.

If this naming convention isn't followed, then these could be reduced to a single method, with the default being to return all values, if range and count not being specified.

  • listVersions(id): returns entire chain
  • `listVersions(id, range="prior", count=1): returns 1 previous id
  • `listVersions(id, range="first"): returns first
  • ...

gothub avatar Sep 09 '20 16:09 gothub

Thanks @csjx, this looks like a great start. I think what would be useful to evaluate it is to consider what use cases we want the API to solve with these APIs. Can we develop those out? Here are a few to get started:

Version info on landing pages

  • Display a dropdown on landing pages with all versions listed to navigate to older or newer versions
  • One request to listVersions(id) on each landing page load
  • Primary client: MetacatUI

Version info for metadata-check analysis

  • Need full version chain information for all dataset versions in DataONE, allowing us to analyze changes in checks across versions and time for specific collections (either all of DataONE, or for sub-collections associated with repositories or portals or users. This generally gets used 1) when a new collection is created, the collection-wide check graphs would be created, or 2) the graphs are updated once every 24 hours
  • currently, the analyses provide two types of version info: 1) an indicator of the version chain that a dataset is in; 2) and an ordered serial version number within that group (e.g., version=[1,2,3,...])
  • Ideally we want the version information to be in a collection-specific data source that is linked to each PID and is correlated to the check-assessment data; in our analytical scripts, we access this data in tabular form for all PIDs in the collection
PID versionGroup version formatId dateUpdated ...
P1 group1 1 text/csv 2020-01-01
P2 group1 2 text/csv 2020-02-01
P3 group1 3 text/csv 2020-03-01
P4 group2 1 text/csv 2019-06-01
P5 group2 2 text/csv 2019-07-01
  • I'm not sure we would use the API described above at all for this use case, but rather want the data accessible in SOLR or another integrated data source along with other classifiers for a PID that are used to facet sub-collections
  • Primary client: MetaDIG graph service

Version info for metric analysis

  • Same as previous, but for the metrics service; need to determine if and how it differs
  • Primary client: Metrics service

mbjones avatar Sep 09 '20 18:09 mbjones

Another use case:

Version info for cloning content programatically

  • Using clients in Java, R, and Python, we have needed to clone one or more objects from one member node to another, along with all prior versions. In some cases, we need to migrate all content from one member node to another.

  • This usually involves repeated calls to MN.listObjects(). For each object, we need to get the first version, and have traditionally traversed down the version chain with repeated calls to MN.getSystemMetadata(pid) to read the obsoletes property.

  • In this use case, we would make one call to MN.getVersion(pid, "first") (or MN.listVersions(pid, "first") if we decide on a single API call). We then are able to start migrating the objects with calls to MN.get(pid) and MN.getSystemMetadata(pid).

@rushirajnenuji - Will you add in the Metrics use cases that @mbjones mentioned above?

csjx avatar Sep 10 '20 14:09 csjx

Since a version chain can be infinitely long, it seems to me like we'd want full pagination support. This'd mean having a start or offset argument to pair with count on listVersions. Essentially the same thing listObjects already has.

And here's a use case we currently have that could be improved by this API change:

Use case: Forwarding client requests to the latest version

Clients include MetacatUI, R, etc . Current implementations walk the version chain forward with repeated getSytsemMetadata calls and this API would reduce the number of requests the client needs to send from n to one.

Edit: I see this is basically https://github.com/NCEAS/metacatui/issues/1400.

amoeba avatar Sep 10 '20 19:09 amoeba

Version info for generating metrics.

  • Use case 1: For the Metrics Service, we'll want a list of identifiers for the current version as well as all the identifiers from the previous versions of that dataset. We then aggregate the total counts of read events across this list to generate metrics.

  • Use case 2: DataONE aggregated metrics: portal metrics and user profile metrics. For this, we follow the same procedure as Usecase 1, but for the entire collection of datasets.

rushirajnenuji avatar Sep 10 '20 19:09 rushirajnenuji

We discussed this on our Sep 10, 2020 dev call and continued in a break out afterwards for a bit. I'm commenting here to bring some of that discussion back here for visibility.

From looking at our use cases, we found we really have three categories:

  1. Clients that need information about one or more versions in an object series. e.g., given an object's PID, tell me newest version
  2. Clients that need a virtual identifier to identify, resolve, and detect membership within the entire series of versions for an object. This is kinda like a series ID, but would be 1:1 with the entire series, whereas an object version series can have more than one series ID. e.g., Metadig calculates scores for an entire object version series and needs an identifier to group by
  3. Clients that need a virtual identifier to identify and resolve all objects and versions within a collection of objects. e.g., Metadig & metrics for portals

(1) is different from (2) and (3) and (2) and (3) are alike to one another in that they're about the idea of virtual identifiers. (3) is complicated enough we might stick to trying to tackle just (1) and possibly (2) for now.

Where is this information available to clients?: None of the use cases are real-time enough that we strictly to make the information available directly from the database which means Solr could be the place where this information lives and we don't necessarily need to expose any new APIs here.

Is version information always public or restricted to those who can read? If we exposed an API like listVersions above, would it include in its response versions the client can't read? If not, handling the return value gets tricky because of the gaps. If we store a virtual identifier as another Solr field, object visibility results would apply by default.

What can we solve?

Another way to describe (1) is that clients are currently walking sysmeta via the obsoletes and obsoletedBy properties. This is slow, foremost, but is also limiting (as seen in https://github.com/NCEAS/metacatui/issues/1400) because that walking process stops if the client doesn't have access to a version in the middle of the chain.

  • An API like listVersions solves this because clients can make the additional request to listVersions to get the info it needs.
  • A virtual identifier also solves this because the client can query the Solr index for other objects with the same virtual identifier

Another way to describe (2) and (3) is that external services such as Metadig & Metrics are doing their own version chain crawl (use case group 1) and creating virtual identifiers for object series (use case group 2).

  • An API like listVersions helps them with crawling version chains but they still need to generate virtual identifiers
  • Having a virtual identifier as a first-class identifier in DataONE means all clients can use this instead of re-creating their own. This might lead to better integration across services.

Steps forward

Ignoring how for now, we have two things we can do:

A. Implement a versions API B. Implement a virtual (or even a real) identifier for every object version series

(A) solves some of our problems and, IMO, is valuable in its own right whether it solves our problems or not. And it's the simpler/easier thing to do. (B) solves more/most of our problems but is a bit more involved though still tractable.

My vote at this point is for starting with a versions API (A) implemented with recursive CTEs. This avoids major changes to the database (which increases our time to implement and impacts all users when we/they upgrade), appears to be plenty performant, and can have its implementation swapped out at a later date if we implement some form of virtual identifier.

Please have a read and let me know if I've represented your case accurately, missed any glaring upsides/downsides, etc.

amoeba avatar Sep 12 '20 01:09 amoeba

@taojing2002 @artntek Let's please discuss this version chain API request wrt upcoming Metacat releases. If we take the CTE approach as @amoeba concludes, then this could be done with the existing data tables and may not take too much time. I'd like to use it for both the MetaDIG and Metrics services.

mbjones avatar May 27 '24 16:05 mbjones

query to retrieve a table of a bunch of identifiers with a version series identifier (which is the identifier of the first version in the chain), the identifier, obsoletes, obsoleted by, and version number:

WITH RECURSIVE children(versionGroup, guid, obsoletes, obsoleted_by, version) AS (
  SELECT sm.guid AS versionGroup, sm.guid, sm.obsoletes, sm.obsoleted_by, 0 AS version
  FROM systemmetadata sm
  -- depending on your use case you may want to modify this section to select your initial set of pids
  WHERE sm.object_format LIKE '%eml%'
  OR sm.object_format LIKE '%iso%'
  -- end special query section
  AND sm.obsoletes IS NULL

  UNION ALL

  SELECT c.versionGroup, sm.guid, sm.obsoletes, sm.obsoleted_by, c.version + 1
  FROM systemmetadata sm
  INNER JOIN children c ON sm.obsoletes = c.guid
)
SELECT * FROM children ORDER by versionGroup, version;

as a bonus, you can run this from outside a pod and surround it in a copy to get it into a file locally. example:

kubectl exec -it metacatarctic-postgresql-0 -- psql metacat -U metacat -c "\copy (
WITH RECURSIVE children(versionGroup, guid, obsoletes, obsoleted_by, version) AS (
  SELECT sm.guid AS versionGroup, sm.guid, sm.obsoletes, sm.obsoleted_by, 0 AS version
  FROM systemmetadata sm
  WHERE sm.object_format LIKE '%eml%'
  OR sm.object_format LIKE '%iso%'
  AND sm.obsoletes IS NULL

  UNION ALL

  SELECT c.versionGroup, sm.guid, sm.obsoletes, sm.obsoleted_by, c.version + 1
  FROM systemmetadata sm
  INNER JOIN children c ON sm.obsoletes = c.guid
)
SELECT * FROM children ORDER by versionGroup, version
) TO STDOUT WITH CSV HEADER" > ./chains.csv

jeanetteclark avatar Aug 21 '25 23:08 jeanetteclark

Thanks @jeanetteclark After our discussion on this, I gave it one more run and found a small bug in the logic, where missing parens from the anchor query were causing duplicated results. Here's a modified version that makes the SQL logic a bit clearer, and makes it easier to modify the conditions in the anchor query:

WITH RECURSIVE children(versionGroup, guid, obsoletes, obsoleted_by, version) AS (
  SELECT sm.guid AS versionGroup, sm.guid, sm.obsoletes, sm.obsoleted_by, 0 AS version
    FROM systemmetadata sm
    WHERE sm.obsoletes IS NULL
    -- depending on your use case you may want to modify this section to select your initial set of pids
    AND (sm.object_format LIKE '%eml%' OR sm.object_format LIKE '%iso%')
    AND sm.guid LIKE 'urn:uuid:fbb%'
    -- end special query section

  UNION ALL

  SELECT c.versionGroup, sm.guid, sm.obsoletes, sm.obsoleted_by, c.version + 1
    FROM systemmetadata sm
    INNER JOIN children c ON sm.obsoletes = c.guid
)
SELECT * FROM children ORDER by versionGroup, version;

This is an exciting query, because it now sets us up to build the API we described above. I think its trivially modified to hit all of those use cases, and could implement the API functions in issue comment https://github.com/NCEAS/metacat/issues/1429#issuecomment-689690984 above, namely these two new API methods on Metacat:

  • listVersions(id, range="prior|next", count=5) : VersionList
  • getVersion(id, position="first|prior|next|last") : Identifier

For reference, the above query gives the following results on test:

                 versiongroup                  |                     guid                      |                   obsoletes                   |                 obsoleted_by                  | version
-----------------------------------------------+-----------------------------------------------+-----------------------------------------------+-----------------------------------------------+---------
 urn:uuid:fbb16c32-baaf-4c3a-ad9d-44f53cd3f04e | urn:uuid:fbb16c32-baaf-4c3a-ad9d-44f53cd3f04e |                                               | urn:uuid:662a972b-19a6-44c4-bc1e-828c9d051ede |       0
 urn:uuid:fbb16c32-baaf-4c3a-ad9d-44f53cd3f04e | urn:uuid:662a972b-19a6-44c4-bc1e-828c9d051ede | urn:uuid:fbb16c32-baaf-4c3a-ad9d-44f53cd3f04e |                                               |       1
 urn:uuid:fbb1da0e-ddcc-440c-8e66-6a228553e310 | urn:uuid:fbb1da0e-ddcc-440c-8e66-6a228553e310 |                                               |                                               |       0
 urn:uuid:fbbb046c-8bcb-431b-a7d4-1841d24b0bbd | urn:uuid:fbbb046c-8bcb-431b-a7d4-1841d24b0bbd |                                               |                                               |       0
 urn:uuid:fbbb6921-4a80-4ba8-8e4e-86f7824881a8 | urn:uuid:fbbb6921-4a80-4ba8-8e4e-86f7824881a8 |                                               |                                               |       0
 urn:uuid:fbbc4074-afb7-40a0-a0ae-47a5a7a6db7f | urn:uuid:fbbc4074-afb7-40a0-a0ae-47a5a7a6db7f |                                               |                                               |       0
 urn:uuid:fbbe9e57-622d-4804-8c98-5673614660dc | urn:uuid:fbbe9e57-622d-4804-8c98-5673614660dc |                                               | doi:10.18739/A2K929                           |       0
 urn:uuid:fbbe9e57-622d-4804-8c98-5673614660dc | doi:10.18739/A2K929                           | urn:uuid:fbbe9e57-622d-4804-8c98-5673614660dc |                                               |       1
 urn:uuid:fbbf0327-f914-4b80-b7f0-dc08acd94f09 | urn:uuid:fbbf0327-f914-4b80-b7f0-dc08acd94f09 |                                               | urn:uuid:4e4eb469-50aa-4d5c-a0ec-46921bfd634b |       0
 urn:uuid:fbbf0327-f914-4b80-b7f0-dc08acd94f09 | urn:uuid:4e4eb469-50aa-4d5c-a0ec-46921bfd634b | urn:uuid:fbbf0327-f914-4b80-b7f0-dc08acd94f09 |                                               |       1
 urn:uuid:fbbf5ce9-2f52-40cd-805a-fde599e31796 | urn:uuid:fbbf5ce9-2f52-40cd-805a-fde599e31796 |                                               |                                               |       0

mbjones avatar Aug 22 '25 00:08 mbjones

And one last edit to improve naming (versions rather than children) and improve column order to show that the PK for the table is guid and that versionGroup and version are the key properties. Now we have:

WITH RECURSIVE versions(guid, versionGroup, version, obsoletes, obsoleted_by) AS (
  SELECT sm.guid, sm.guid AS versionGroup, 0, sm.obsoletes, sm.obsoleted_by AS version
    FROM systemmetadata sm
    WHERE sm.obsoletes IS NULL
    -- depending on your use case you may want to modify this section to select your initial set of pids
    AND (sm.object_format LIKE '%eml%' OR sm.object_format LIKE '%iso%')
    AND sm.guid LIKE 'urn:uuid:fbb%'
    -- end special query section

  UNION ALL

  SELECT sm.guid, v.versionGroup, v.version + 1, sm.obsoletes, sm.obsoleted_by
    FROM systemmetadata sm
    INNER JOIN versions v ON sm.obsoletes = v.guid
)
SELECT * FROM versions ORDER by versionGroup, version;
                     guid                      |                 versiongroup                  | version |                   obsoletes                   |                 obsoleted_by
-----------------------------------------------+-----------------------------------------------+---------+-----------------------------------------------+-----------------------------------------------
 urn:uuid:fbb16c32-baaf-4c3a-ad9d-44f53cd3f04e | urn:uuid:fbb16c32-baaf-4c3a-ad9d-44f53cd3f04e |       0 |                                               | urn:uuid:662a972b-19a6-44c4-bc1e-828c9d051ede
 urn:uuid:662a972b-19a6-44c4-bc1e-828c9d051ede | urn:uuid:fbb16c32-baaf-4c3a-ad9d-44f53cd3f04e |       1 | urn:uuid:fbb16c32-baaf-4c3a-ad9d-44f53cd3f04e |
 urn:uuid:fbb1da0e-ddcc-440c-8e66-6a228553e310 | urn:uuid:fbb1da0e-ddcc-440c-8e66-6a228553e310 |       0 |                                               |
 urn:uuid:fbbb046c-8bcb-431b-a7d4-1841d24b0bbd | urn:uuid:fbbb046c-8bcb-431b-a7d4-1841d24b0bbd |       0 |                                               |
 urn:uuid:fbbb6921-4a80-4ba8-8e4e-86f7824881a8 | urn:uuid:fbbb6921-4a80-4ba8-8e4e-86f7824881a8 |       0 |                                               |
 urn:uuid:fbbc4074-afb7-40a0-a0ae-47a5a7a6db7f | urn:uuid:fbbc4074-afb7-40a0-a0ae-47a5a7a6db7f |       0 |                                               |
 urn:uuid:fbbe9e57-622d-4804-8c98-5673614660dc | urn:uuid:fbbe9e57-622d-4804-8c98-5673614660dc |       0 |                                               | doi:10.18739/A2K929
 doi:10.18739/A2K929                           | urn:uuid:fbbe9e57-622d-4804-8c98-5673614660dc |       1 | urn:uuid:fbbe9e57-622d-4804-8c98-5673614660dc |
 urn:uuid:fbbf0327-f914-4b80-b7f0-dc08acd94f09 | urn:uuid:fbbf0327-f914-4b80-b7f0-dc08acd94f09 |       0 |                                               | urn:uuid:4e4eb469-50aa-4d5c-a0ec-46921bfd634b
 urn:uuid:4e4eb469-50aa-4d5c-a0ec-46921bfd634b | urn:uuid:fbbf0327-f914-4b80-b7f0-dc08acd94f09 |       1 | urn:uuid:fbbf0327-f914-4b80-b7f0-dc08acd94f09 |
 urn:uuid:fbbf5ce9-2f52-40cd-805a-fde599e31796 | urn:uuid:fbbf5ce9-2f52-40cd-805a-fde599e31796 |       0 |                                               |

mbjones avatar Aug 22 '25 00:08 mbjones

I'm not totally sure it satisfies the prior in:

    listVersions(id, range="prior|next", count=5) : VersionList
    getVersion(id, position="first|prior|next|last") : Identifier

since it only looks at obsoletes and not obsoletedBy. We might need to do the recursion twice? Once forwards, once backwards, if we don't know that we are starting at the beginning of a version chain. In this query we do know we are since the initial query specifies WHERE sm.obsoletes IS NULL. Maybe I'm wrong about this though.

jeanetteclark avatar Aug 22 '25 02:08 jeanetteclark

yes, good point. but I think a simple variant on the query can pull the full version chain for any given PID. It may require two queries that get UNIONED if the requested PID is in the middle of the chain, and maybe we can just drop the NULL check when we're processing a specific PID.

mbjones avatar Aug 22 '25 16:08 mbjones