Give croissant blobs a version?
Datasets have versions, but the croissants that describe them do not.
This makes it a little dangerous to improve a croissant over time -- perhaps adding more descriptive fields -- for a dataset which has not changed.
Other times the croissant might change because of the addition of new files to the dataset, specifically an additional index. We've done this several times in the past at Common Crawl, and plan to do it again.
What's the best way to proceed?
cc @handecelikkanat
I think you are describing three types of changes:
- Change to the metadata of a dataset (e.g., new descriptive field)
- Change to the data representation of a dataset (e.g., new index file)
- (Implicitly) Change to the actual data contained in the dataset (e.g., new crawl)
It sounds like you are considering mostly 3) to be an actual version change of the dataset, and are looking for a mechanism to describe 1) and 2).
I am actually thinking of all three as changes to the dataset, which warrant a version change. Perhaps SemVer can be used to introduce some hierarchy on changes. Does that make sense to you?
As a side note, we have the "isLiveDataset" mechanism to describe datasets where the data changes rapidly but we don't want to be constantly creating new versions.
It sounds like you are considering mostly 3) to be an actual version change of the dataset, and are looking for a mechanism to describe 1) and 2).
Correct.
I am actually thinking of all three as changes to the dataset, which warrant a version change.
@wumpus can comment on this better, but I think we want to be clear that our data is not changing. Crawls are guaranteed to be fixed once they are published, so people can rely on it being always the same once they appear.
So we want to mark the change as explicitly in the croissant (eg, because new croissant version is available, (1) in your case), or in the descriptive data (if we publish new indices about an existing crawl, (2) in your case).
I am concerned if a SemVer confuses people, into thinking our crawls (eg. June crawl) might have more than one version around. Would it be a bit counter-intuitive to for people to distinguish that June Crawl 1.0.2 means only a new croissant version, and not an updated Crawl data, maybe?
Personally Id prefer to mark croissant/metadata/index etc. changes separately. But Greg would have a better opinion about it.
SemVer has a "build identifier" that we can use -- it looks like 1.0.0+1.0.2, which means the version is 1.0.0 and the build is 1.0.2. We would keep 1.0.0 constant to represent the data not changing and increment the build 1.0.2 as we improve the croissant.
We are going to use this semver feature in our Croissants. @benjelloun I think it would be useful to mention this semver feature in the 1.1 🥐 spec
I like distinguishing between the dataset version and the metadata version.
There is something similar happening in schema.org around licenses, and a few other fields. The license property is the license of the dataset, which generally means the data. There is also an sdLicense property, which is about the license for the metadata. The same mechanism is used to define sdDatePublished and sdPublisher.
Following this approach, we could define an sdVersion property (initially in the Croissant vocabulary, but that could be adopted by schema.org down the line), and use that to hold the version of the metadata.
So in your example, instead of having "version"="1.0.0+1.0.2" we would have "version"="1.0.0", "sdVersion"="1.0.2". This also gives you freedom to use a different versioning approach for sdVersion if that makes sense.
How does this sound? If this approach makes sense to you, I can introduce it in the 1.1 Croissant spec, since it's a pretty limited change.
Sounds great! Thank you.