Versioning of FOCUS normalized data over long time periods?
Description
Hi Folks, I cant remember if we discussed this already or not, but how are we handling changes in the specification for data sets that may have multiple years of history?
For example:
If I start out using FOCUS today I would normalize all my data to the V1 spec (so far so good), however 9 months down the line FOCUS releases the V2 spec and 2 years down the line we are using the V3 spec. Bearing in mind I have been normalizing my cloud spend into a BI cube using the converters during this time what is the expected behavior whenever we release new versions of the spec? (I can see a case for storing up to 7 years of data.)
Are we expecting folks to keep the full history of all the raw/ pre-normalized source files and reprocess the whole data history to the latest spec or are we planning to introduce some version identification on each row to identify which version of the spec it complies to allowing the schema to change over time and remain valid?
Proposed approach
This suggestion is to include an new REQUIRED / NOT NULL column in the spec to include a version number / identifier to indicate which version of the spec this billing line was generated against.
This would allow folks to adopt newer versions of the spec without needing to retain all the raw / source data files in order to reprocess their whole history into the latest version of the spec.
Github issue or Reference
Spec-wide issue
Context
Issue Resources
- Discussion Documents: https://drive.google.com/drive/folders/1NF0FYaysl-Gigl5xzQsBsXx4d8SecxRM?usp=drive_link
I would not specify the version in the actual file. Rather, I would place the version in a manifest or schema document that would accompany a data drop from a provider.
There's value in adding a version column, given you'll have historical data that you need to know the version for when it's queried possibly years later. While we can easily add this at any point, I would personally like to see this added for 1.0 to make it easier to handle the differences in 1.0-preview and 1.0. The longer we wait, the harder it makes dealing with mixed versions.
Would it make sense to consider alignment with versioning standards, such as SemVer and introduce minor, non-breaking versus major breaking changes accompanied by mapping rules?
I agree with @macko76. I suggest following up on the guidelines for Semantic Versioning. The official release of version 1.0 would only occur after the final approval by the working group and its ratification by the Steering Committee. Subsequently, the team will proceed with developing the next release version, typically denoted as v1.1. This implies that the modifications in the release v1.1 will maintain backward compatibility with release v1.0. However, in the scenario where the group decides a change that lacks backward compatibility, the release should be labeled as v2.0.
This issue was marked a P1 by TF-1 on May 21.
I support putting the version number in the manifest for each delivery.
If appending different versions of data, a practitioner should be able to add the column to show this, but generally we wouldn't recommend appending different versions before conversion to the same spec version.
Classified as Version Lifecycle by the Maintainers on the May 24 call.
Document moved to #397
Action Items:
- [ ] TF3-#332 Riley to draft a summary of the proposed guidelines and scenarios for schema drift management.
- [ ] Irina to prepare sample data with different versions based on the following spreadsheet: Usage vs Pricing quantity and unit examples
- [ ] All members must review and provide feedback on the sample dataset and proposed metadata versioning approach.
Action Items, Members', Aug 15:
[Members-#332] Riley to finalize the PR and incorporate feedback from the team.
Spreadsheet: FOCUS v1.1 0 Dataset and Medata. Irena provided a detailed explanation of the spreadsheet content.
Action Items from TF-3 Meeting on July 5th.
- [ ] TF-#332 Riley: Draft a summary of proposed guidelines and scenarios for schema drift management. Prepare a proposal for metadata versioning and include historical metadata support considerations.
- [ ] TF-#332 Joaquin: Communicate to the members the new Google folder to upload documents related to this issue: Google Folder: #332
Action Items from Maintainers call on Jul 22:
- [ ] Maintainers-#332: Riley to coordinate with maintainers and develop a detailed plan. Present the plan and gather feedback in the next TF3 meeting.
Action Items from the TF-3 call on Jul 26:
- [ ] TF3-#332 Riley to update the proposal with more detailed examples and clarify the requirements.
- [ ] TF3-#332 Udom and Riley to work together on structuring normative text in the metadata section. Team members will review the updated proposal and provide feedback, aiming for a finalized draft in three weeks.
Action items from the Maintainer's call on Aug 5th:
- [ ] [TF3-#332] Riley will draft a JSON representation examples for metadata.
Action Items, TF-3 meeting on Aug 16:
- [ ] [TF3-#332-#514] Riley will merge the provider version PR into #514 and add examples for provider version changes.
- [ ] [TF3-#332-#514] A new schema object will be created for any structural changes to the data, ensuring accurate versioning.
- [ ] The team will continue refining the versioning strategy, ensuring clear communication of changes to end-users.