Stockfish
Stockfish copied to clipboard
Add some meta-data in the generated nn files
Based on a discussion started https://github.com/glinscott/fishtest/issues/732#issuecomment-668523399
It might be an idea to have some meta-data in the data files (nn.bin, or nn-sha.nnue) that give information on the net, and could be parsed by the chess engine / database etc.
Fields could be author: date: description:
This might need some careful thinking, and maybe we can learn from other data formats
Thank you for creating the issue. I'm thinking if it is good to add those meta-data at the end of net files. My concerns are if this breaks the loading on old binaries, and if this format is extensible. I will study other data formats.
concerning loading older nets.. at least based on just the stockfish perspective, I think this is not so much of a problem. This is still 'eary days', so this kind of changes should still be OK (and is one of the reasons why I want to link SF binary, and recommended net so tightly, i.e. allow for this kind of changes). Obviously, if things can be backwards compatible without much overhead, that's worth considering.
ok. I will think how we can realize the file format which is backwards compatible.
I investigated the source code. The current net file reader checks if a stream reached the end of the stream after the network parameters are read. If we add additional data at the end of the file, reading a net file always fails.
A net file contains a version field in the header. We can add meta-data at the end of the file if we introduce a new version number. We will change the reader to switch if we read the meta-data or not according to the version number.
The current file format is:
| field | type | num bytes |
|---|---|---|
| version | uint32_t | 4 |
| hash_value | uint32_t | 4 |
| size | uint32_t | 4 |
| architecture | char | size |
| feature_transformer | - | - |
| network | - | - |
The type and size of the feature_transformer and network are hard coded in the network implementation.
If we add meta-data, the format will be like.
| field | type | num bytes |
|---|---|---|
| version | uint32_t | 4 |
| hash_value | uint32_t | 4 |
| size | uint32_t | 4 |
| architecture | char | size |
| feature_transformer | - | - |
| network | - | - |
| meta-data size | uint32_t | 4 |
| meta-data | char | meta-data size |
And the meta-data will be a text like:
author foo bar
date 2004-04-01T12:00+09:00
description something something.
The implementation of meta-data reader will be like:
std::istringstream iss(metadata_string);
std::map<string, string> metadata;
std::string key;
while (iss >> key) {
iss.ignore();
std::string value;
std::getline(iss, value);
metadata[key] = value;
}
The current format will be backward compatible. But the suggested format is not forward compatible.
@vondele Could you please review the format, and let me know if there are concerns?
looks good, but I'm wondering if we should go for a 'free format / free size', or if it is better to introduce some restrictions.
E.g. should we use/require ascii or something else (even with a simple date, we once had problems in fishtest: https://groups.google.com/g/fishcooking/c/SOpgUxPYZa0 )
Should we restrict the value to the first 128/256 characters of the metadata (metadata[key] = value.substr(0,128)) to avoid long/malicious entries?
Probably we could also list the default keys (author, date, description sounds like a good start).
E.g. should we use/require ascii or something else (even with a simple date, we once had problems in fishtest: https://groups.google.com/g/fishcooking/c/SOpgUxPYZa0 )
It sounds reasonable. Let's requisre ascii. It could be better to restrict the date and time format to ISO 8601.
Should we restrict the value to the first 128/256 characters of the metadata (metadata[key] = value.substr(0,128)) to avoid long/malicious entries?
It also sounds reasonable. Let's restrict the value to the first 128 characters of the metadata.
Probably we could also list the default keys (author, date, description sounds like a good start).
I agree. "author", "date" and "description" should be the default keys.
I will send a pull request to official Stockfish before I push them to the master branch in my repository to avoid issues that the official Stockfish can not read the new net format.
The description field is quite flexible and supports longer strings, so I think we can repurpose it for this with a loose format.