politeia icon indicating copy to clipboard operation
politeia copied to clipboard

Import legacy git proposals into tstore.

Open lukebp opened this issue 3 years ago • 4 comments

The proposals that were submitted to the git backend are currenlty hosted on proposals-archive.decred.org, while all new proposals are hosted on proposals.decred.org. This creates a fragmentation that diminishes the ability to find older proposals.

The legacy proposals are currently hard coded into the gui and displayed in the list views, but this approach is suboptimal since the proposals are not included in politeia functionality such as searching for proposals by user ID. A better approach would be to import the git backend proposals directly into the tstore backend.

There's two main issues with importing the git backend proposals into the tstore backend.

  • It breaks the git backend timestamps. The git backend timestamps the git commit hash. You can obtain the timetamp data and the hash of a proposal file fairly easily, but there is no easy way of proving the file hash is included in the timestamp. In order to keep the timestamps coherent, you must take the git repo as a single entity that can't be pulled apart.
  • A git backend proposal is very different than a tstore backend proposal. The proposal markdown file is the same, but all of the metadata that accompanies the proposal is different due to the changes in the plugin architecture. You can't import the legacy metadata files into the new backend, which means providing the original hash and timestamp will be pointless since the original hash won't match what is imported into the backend.

Solution

Import the legacy git backend proposals using the format required by the tstore backend while also keeping the proposals-archive site up. There would be an additional LegacyToken field in the proposal metadata. When set, the gui will indicate that the proposal is a legacy proposal and you must go to the [proposals-archive link] if you want to see the proposal in its original form with valid timestamps.

This would solve the UX issues of legacy proposals not showing up on the proposals.decred.org site while also sidestepping the timestamp and incompatible format issues.

Implementation

  • Add a LegacyToken field to the proposal metadata.
type ProposalMetadata struct {
	Name string `json:"name"` // Proposal name

	// LegacyToken will only be populated if the proposal is a legacy
	// proposal that was submitted to the git backend.
	LegacyToken string `json:"legacytoken,omitempty"`
}
  • Write a tool that formats the git backend proposals into the tstore format and submits them to politeiad. Have all of the legacy proposals hardcoded into the tool.
  • Use the tool to import the legacy git backend proposals into the tstore backend.

lukebp avatar Jun 05 '21 19:06 lukebp

politeia changes

  • [ ] Add LegacyToken to ProposalMetadata.
  • [ ] Prevent LegacyToken from being filled in on normal proposal submissions.

politeia legacy import tool

Here are some additional details on what is likely the easiest way to accomplish this task.

  • Rather than hardcoding all of the proposal data, I think it would be easier to walk the mainnet git repo and parse then convert the data. The politeiad/cmd/politeiaimport tool already has some of the code required to do this.
  • The tool should not use the politeiad or backend API. It should initialize a tstore instance directly and use the tstore API.
  • The legacy metadata streams will need to be converted over to the data structures that the current plugin metadata structures (usermd, comments, ticketvote). These may not be a simple 1-to-1 conversion since the metadata structure might have changed in the tstore upgrade. We can deal with these issues as they arise.
  • The legacy comment and vote journals will need to be walked and parsed. The code to replay the journals can be found in the gitbe implemenations that existed in the politeia repo prior to the v1.0.0 release.
  • The recordmetadata.json and the proposal files should be a simple 1-to-1 conversion. I don't think the record metadata structure changed in the tstore update.

This tool is only going to need to be used once. Once the legacy data has been migrated, the tool can be deleted from the politeia repo. Since this is a one off tool you can do things the quick and dirty way. The code should still be clean and readable, but things like hardcoding certain values or writings local structs to decode data into is fine.

lukebp avatar Jul 21 '21 16:07 lukebp

As expected, this has turned out to be quite a difficult and complex task.

One of the main issues that we're encounting is the fact that the record token will not be the same. The tlog backend derives the record token from the tlog tree ID. This tree ID is a random int64 that is set by tlog on tree creation. We do not have the ability to set custom tree IDs, which means that legacy proposals will be assigned new tokens when they're imported into the tlog backend.

This is problematic because the token is part of the message that clients sign when submitting data like comments and votes. This leaves us with a decision to make. We can either:

  1. Keep the token fields in the data set to the legacy tokens so that the signatures remain coherent, but at the cost of breaking various parts of the backend. The backend assumes that the record token and the tree ID reference the same underlying bits, just encoded differently (int64 for tree IDs, hex for record tokens). Using the legacy token in the token field of the data breaks this assumption, which will cause various parts of the politeia and politeiagui code to break. The scope of this problem is somewhat limited though since we only need to worry about code that retrieves data, not code that writes.

  2. Update the token fields of legacy data to match the token derived from the tlog tree ID in order to not break the backend code, but at the cost of breaking all of the client signatures. This is problematic since signature validation is a standard part of both the backend and client side code when retrieving data. If we went this route we would need to insert the legacy proposals into tlog, compile a list of the tlog tokens that correspond to legacy proposals, hardcode the list into both politeia and politeiagui, then update the code to skip signature validation for any tokens in this list.

There are also instances where the data format, and thus the message being signed, changed between the git backend and tlog backend. In these cases, even if you use the legacy token the client signature will still be invalid because of the data format changes. The StartVote structure is one such instance.

We decided to go with option 1. There will be various bugs and edge cases that will need to be fixed, but since this is only for reads and not writes, the impact of such bugs will be minimal and can be fixed as they are found. If we went with option 2, hardcoding in everything required to skip the signature validation checks for these legacy proposals would be just as much, if not more of a headache. Unfortunately, we will still need to hardcode in the signature validation skips for the small number of invalid signatures that will still be present due to data format changes, like with the StartVote. There's not really much we can do to get around that for now.

lukebp avatar Oct 18 '21 18:10 lukebp

Another large challenge with this is the cast vote timestamps.

The git backend did not include the timestamp of when a vote was cast due to privacy concerns. The tlog backend does since they are included in the tlog tree anyway and adding the timestamp to the cast vote struct makes it much easier for dcrdata to build their vote graphs.

In order to get the cast vote timestamp for the legacy votes, we'll need to pull the timestamp of the git commit from when the vote was added. These commits occurred every hour. This is how dcrdata built their vote graphs for the legacy proposal votes, so the code already exists to do this, but porting it over to this import tool and making sure it still works is another big pain point.

lukebp avatar Oct 19 '21 13:10 lukebp

For documentation sake, we found a bug on the votes cache of the legacy www api, which makes the vote count returned from the api differ from when counting it directly from the ballot journal. This will not be a concern anymore once we complete the legacy import to tstore, and further deprecate the legacy www api.

thi4go avatar Nov 01 '21 13:11 thi4go