icu4x
icu4x copied to clipboard
preserve stucture of timezone designation list
The intention of this PR is to maintain the structure of the binary data in the DataBlock struct. The current code produces a list of timezone designations that has the same length as the list of local time records with the i-th timezone designation corresponding to the i-th local time record. In this context the idx field of the local time record does not correspond to a position in the timezone designation list.
In this alternative implementation the timezone designation list directly corresponds to the null separated list of designations in the binary data. Since the idx field of each local time record can refer anywhere in that string, even in the middle of a designation, a helper method is added so that the correct substring is returned.
This allows the DataBlock to capture the data exactly and also potentially re-encode them without loss.
r? @sffc @nekevss
@nordzilla Can you take a look? It is in the tzif crate.
Perhaps @petrosagg could share more about their use case, or the reason why they want this.
Yeah, definitely! I'm porting porting a medium size C datetime parsing library in Rust piece by piece. I'm currently attempting to replace the code that loads a tzif file into memory which is what this library is aiming to do as well. In order to ensure that I'm not accidentally altering the behavior of the library I'm writing fuzzing tests that compare the two implementations. However, for that to work correctly I want to be able to load the tzif file in an identical way.
The unrolling of timezone designations that happens in this library prevents me from doing so since some information is lost during parsing about how the timezone list was laid out. So this PR is maintaining the ability of determining the correct timezone designation in all the edge cases that we discussed in the other comment while also preserving more of the original structure of the file.
An additional benefit, which I don't need right now but I find beneficial, is that when a tzif file is parsed with the current code it is no longer in a form that can be serialized back into disk in a correct way. In order to do so all the index fields of each timezone designation would have to be re-written and the list of designations would also have to be deduplicated. This complication can be avoided if we keep the original structure as done in this PR, which turns serialization back into a tzif file into a simple process of writing out the fields one by one.
Hey @petrosagg
Thanks for all the further explanation and discussion!
I had misinterpreted your code in an attempt to finish this review quickly before the end of my day.
I apologize for that!
I do believe that the code/implementation is correct.
After thinking about it more last night, I was going to suggest that we could just remove the idx property and have the local time type records be parsed to just directly contain their string.
But now it's more clear that there is value for you in being able to round-trip the parsed data back into the original format (which is not a requirement for ICU4X's use case).
I have no strong desire to keep the old code if it is more useful to the Rust community to have it structured like this. Using it for ICU4X data generation would only be slightly different.
I'm happy to accept this code if it's fine with @sffc. I haven't been an active maintainer/contributor to ICU4X for a while, so I feel better delegating that final decision.
This would be a breaking change to the tzif crate and would require a version bump.
In terms of further review, I would like to see a slightly more detailed explanation of the helper function, more than just the one-liner: /// Retrieves the timezone designation at index idx.
It would be helpful to mention what the original format looks like, that the vector of strings are preserved in the original order, and that is why subtracting from the idx value makes sense given the original format that was delimited with null terminators.
That will help prevent people like me from misinterpreting the function in the future :sweat_smile: