iefieldkit
iefieldkit copied to clipboard
iecodebook: data signature `.txt` file is a binary
Maybe this is intentional, but in that case, it should not have a .txt
file format.
Thanks for flagging, but I don't think this is a bug -- this is the intended functionality and I am aware of this behavior. The datasignature
file is created with a .txt
file extension so it syncs to git under our standard .gitignore
settings, and can then be used on another clone to verify data contents as intended. The contents of the file appear as below -- I do not know why this is read as a binary file (other than that it is declared as one in the header; it opens just fine in plaintext editor as shown). Tracking file level changes using git is not necessary here either so I did not pursue a workaround to make it function with git compare features.
If you agree, close issue; otherwise, my suggestion would be to produce it with the .dtasig
extension (edit: or .sig.txt
, although I don't know how "double" extensions work on other OS). However, users would then need to update all their .gitignore
files to include that extension and achieve the intended command functionality. I thought that was an unnecessary extra step when deciding on this approach to the feature, since it has effectively no technical cost other than the ugly aesthetics of this.

Interesting. I see this on Notepad
And this on VSC, where I can't open it
But it pointing out there there are changes in a binary file on Stata is enough for the intended workflow, then we can close this. But perhaps it's useful to explain in the help file what is the purpose of the -sig.txt
file?
What would you want to add beyond the helpfile description here and the write-up in the SJ paper that were just accepted? We have just finished a very long review process for this feature set, so I am a bit confused why this is coming up now 😅
signature This option requests that a datasignature be verified in the same destination folder as the
codebook and/or data are saved, and will return an error if a datasignature file is not found or
is different, guaranteeing data has not changed since the last reset of the signature.
One never knows when a bug appears. I do not think this is major bug (as in requiring a new release), but it should be addressed.
My 2 cent is this multi-staged rocket.
-
Do not allow any file extension other than the
.dtasig
intended. So if someone try to use the nameauto-sig.txt
make the nameauto-sig.txt.dtasig
. File extensions does not change what a file is, only what it tells the computer it should be. And to many software.txt
is raw text, but when the file is not raw text it does not want to interact with it as it could be a corrupted file that potentially could harm the computer. This way users can distinguish this binary file from raw text .txt files in their .gitignore in case they do not want any binary files in their repo. -
This is much more work, but would be the best thing. Use the
r()
values fromdatasignature
to write a real raw text file with the same information. That is whatiesave
does.
But since iesave
does that maybe it is ok to not do that here. And then users who wants to track this type of info use iesave
for this feature and the people who is ok that this is not tracked in git can use iecodebook
for this feature. But then we should not allow the file from iecodebook
to seem like a raw text file.
(edit: or .sig.txt, although I don't know how "double" extensions work on other OS)
The file extension is the text after the last .
. So file kristoffer.luiza.txt.ben
has the file extension .ben
. This does not change if the file is binary or raw text or anything. It just tells the computer it should be the file type .ben
but it says nothing about what .ben
is or should be.
But since this is just text that is a part of a file name then a language can make use of "double extensions". If you end your file names in Spark with .csv.gz
then Spark expect it to be a .gz
compressed file that includes a single .csv
file. Such a file can be read on the fly in Spark without decompress it first. But this is just a naming convention so Spark knows what to expect, which then again is what all file extensions to any computer are.
How does git treat files without an extension? I would rather do that if it works -- Stata does not seem to care about the extension here, but I don't think the datasignature confirm
command will work if we don't use its exact output.
Best,
Benjamin Daniels (he/him) On Feb 7, 2023 at 03:56 -0500, Kristoffer Bjärkefur @.***>, wrote:
(edit: or .sig.txt, although I don't know how "double" extensions work on other OS) The file extension is the text after the last .. So file kristoffer.luiza.txt.ben has the file extension .ben. This does not change if the file is binary or raw text or anything. It just tells the computer it should be the file type .ben but it says nothing about what .ben is or should be. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were assigned.Message ID: @.***>
Git can handle all files. Git ignore is just a string match (a simple version of regex or something) on the file name. So to ignore any file with extension .txt
you do *.txt
, and to ignore a file without extension you do *
but that is not great as it ignore any files with a name, i.e. all files.
No extension does not tell the computer anything about what the file is or should be. So that is almost as bad as the wrong extension in my mind.
What is your reason to not use .dtasig
?
My concern is that people will not have gitignore correctly configured if they use an allowlist approach like we recommend, since dtasig would need a new file extension. The purpose of this functionality is to sync this description of the data over git instead of the data itself. If that requires editing gitignore to work, then I worry the functionality will be lost for many users. That's all! I am not very concerned about what the computer thinks in this case, because the OS and user should never interact with this file -- it is only meant to be touched by Stata once synced.
Best,
Benjamin Daniels (he/him) On Feb 7, 2023 at 10:22 -0500, Kristoffer Bjärkefur @.***>, wrote:
Git can handle all files. Git ignore is just a string match (a simple version of regex or something) on the file name. So to ignore any file with extension .txt you do .txt, and to ignore a file without extension you do * but that is not great as it ignore any files with a name. No extension does not tell the computer anything about what the file is or should be. So that is almost as bad as the wrong extension in my mind. What is your opposition to using .dtasig? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were assigned.Message ID: @.**>
Anyone not using a gitignore file will not ignore it. Most templates do not use allowlists so it would not ignore it. We can update our template that use allowlists to allow it (will only apply to new repos tho).
I think that is the proper long term solution. And whatever we save from not doing the proper long term solution I do not justify relying on a file without file extension. That is bad for the computer, but a file without extension also looks like a corrupted file to a human.
You can deduct my preference, but you are lead author so it is up to you.
I am all for updating the gitignore. We will always have to add new things to it anyway, it's not a static file.
If I see a .txt
file, I expect git to track changes to it well and to be able to see its content. maybe we already had this discussion and I forgot about it and got confused again. but if the intended format for this file is .dtasig
, I think we can stick to it.
Agreed that this does not warrant a new release right now, we can add it to the next one, whenever it comes. The help file is consistent with using .dtasig
too, so it seems like a very small change. But if you feel strongly that .txt
works best with the intended workflow for the command, then I we can leave it as is and just add a line on the helpfile saying that the data signature will be saved using a .txt
format by defaults instead of the standard .dtasig
to facilitate tracking using git.
add a line on the helpfile saying that the data signature will be saved using a .txt format by defaults instead of the standard .dtasig to facilitate tracking using git.
I prefer this most, but as I mentioned I am also happy to, for example, use .sig.txt
instead of -sig.txt
as it is now to indicate that it is not a standard text file -- that that's just for the string matching in tracking. I have seen this style in other applications although it's rather uncommon...