pkgi
pkgi copied to clipboard
Using YAML as data format
If the pkgi.txt file is ever meant to be edited (to which I assume it will be), then using YAML seems like a good choice. The main advantages for YAML are that it is much more readable, allow not including entries, and don't add a lot of 'overhead' for file size. Here is a simple comparaision:
CSV format (original):
UP2089-PCSE00582_00-ADVENTURETIMEPAK,0,Adventure Time,,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg,371557824,a5d40400375659b619391128745d0aa419dea15149b276cc696577dc76b329ac
EP0082-PCSB00975_00-ADVENTURESOFMANA,0,Adventures of Mana,,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,http://zeus.dl.playstation.net/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg,532326688,387845c6100dcf12be914220e246cbf5c227c12c79d686f8231fc3d166c85f0f
JP0082-PCSG00759_00-SEIKENFFGAIDENRM,0,Adventures of Mana,聖剣伝説FF外伝,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,http://zeus.dl.playstation.net/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg,375289280,566642fbfe8b4c9ac2f5690b001e61b7bca609a3a5ef94b22b04e8d19c30e0c4
YAML:
- contentid: UP2089-PCSE00582_00-ADVENTURETIMEPAK
flags: 0
name: Adventure Time
zrif: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
url: http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg
size: 371557824
checksum: a5d40400375659b619391128745d0aa419dea15149b276cc696577dc76b329ac
- contentid: EP0082-PCSB00975_00-ADVENTURESOFMANA
flags: 0
name: Adventures of Mana
zrif: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
url: http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg
size: 532326688
checksum: 387845c6100dcf12be914220e246cbf5c227c12c79d686f8231fc3d166c85f0f
- contentid: JP0082-PCSG00759_00-SEIKENFFGAIDENRM
flags: 0
name: Adventures of Mana
name2: 聖剣伝説FF外伝
zrif: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
url: http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/dummy.pkg
size: 375289280
checksum: 566642fbfe8b4c9ac2f5690b001e61b7bca609a3a5ef94b22b04e8d19c30e0c4
Note how with YAML, the item doens't need to be included if it's empty, here name2. Here you could also remove the flags entry when it is 0 (blank). Taking advantage of these tricks we end up with a file that has very close filesize to the original csv (using input with about 900 entries):
$ du -b pkgi.yaml pkgi.csvoriginal
387951 pkgi.yaml
340393 pkgi.csvoriginal
which amounts to a ~12,2% filesize increase from the transformation to YAML.
What's wrong with editing CSV files? 1) load them into Excel/LibreOffice/Google Sheets, 2) edit, 3) export back to csv.
What's wrong with editing CSV files?
In European regions, Excel will export CSV files by default not with commas but with semicolons. This is because commas are commonly used in European numbers (eg. 12,34 instead of 12.34.) In general, I think editing a file directly with whatever text editor is much more palable than having to open an huge office suite or slow web page every time you want to make small edits to your database. The workflow is the same with an editor, except it's much easier! You could argue you can also edit the csv with a simple text editor, but even right now the db is barely editable, compared to the yaml version which is much easier to read (and more flexible; I'll expand on this later), so I expect it to become completly unreadable the moment more entries are added. What if you are only on the Vita (no computer) and need to edit the db to fix a link?
To further my point here, let's compare further CSV with YAML:
Parsability
The only advantage I can attribute csv here is that it is smaller by yaml, by nature of its simplistic format. But this also lends itself to some flaws, in particular with the types of strings you have to deal with on the Vita (game names). Also, even the current parsing has bugs: #19
YAML is standard, so it has implementations for parsing in quite a lot of languages. Also, it facilitates parsing should the database ever "evolve" beyond pkgi itself; for example, a web site that automatically stores and updates the database..
Design
The way csv is structured leads itself to several design flaws. In this case I can point out a simple one: the "name2" item (name_org, aka. original name in the code) is flawed, because the "name" entry represents the original name of the game if alone, or, if the game name is not alphabetic (A-Z, a-z), then it is "translated" to an alphabetic name in the "name" entry and "name2" then becomes the original name. What if, instead of this dual function, "name" simply always contained the original name, and "name2" only contained the alternative name? Then "name" could always be a consistent entry, and name2 could be used when it exists. Except this cannot be solved reliably with the current CSV implementation because it would break the parsing.
Another example: how do you handle games that are the same (same name, etc) but have different regions? How do you even find the region of a game? CSV has no answer for this, and so the solution in the code is a parse of the titleid to figure out the region.
What if I add some new item, but decide to remove it later due to obsolete functionality? Good luck with CSV: now I have one extra colon to add to every line, and each entry becomes further mind gymnastics as manual editing becomes even harder and the database gets bigger.
What about YAML?
So, why did I suggest using YAML instead of CSV? Simply, because it solves all these problems, while managing to keep a good size (I think with some actual effort it can be made even smaller than my lazy conversion).
Let's tackle the problem I mentioned earlier about regions. As mentioned in this discussion, most of the work resides on the database side. Take my example database here; let's make something that looks much better and is easier to parse as well! (Note that, even if you take the same implementation as the CSV, eg. an entry for each item, you can still use node anchors and references to omit the repeating data, which leads to smaller size overall.)
- contentid:
JPN: JP0082-PCSG00759_00-SEIKENFFGAIDENRM
EUR: EP0082-PCSB00975_00-ADVENTURESOFMANA
name: Adventures of Mana
name2: 聖剣伝説FF外伝
zrif:
JPN: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
EUR: bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
url:
JPN: http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/japan.pkg
EUR: http://zeus.dl.playstation.net/cdn/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/europe.pkg
size:
JPN: 375289280
EUR: 532326688
checksum:
JPN: 387845c6100dcf12be914220e246cbf5c227c12c79d686f8231fc3d166c85f0f
EUR: 566642fbfe8b4c9ac2f5690b001e61b7bca609a3a5ef94b22b04e8d19c30e0c4
author: Square Enix
In fact, here I can add new items easily, and I can omit some optional items if I want (such as, say, game description, or author, or all kind of lesser data.) With this design, data can be differentiated for multiple regions, or just put as a single entry if needed.
Why not use a tool like csved? Though i must say yaml makes more sense for DLC.
http://csved.sjfrancke.nl
As for your worries about linux users it runs great in wine! snappy tool that is freeware (not foss though if that is a concern) on something like a vita this 12 percent reduction in size is significant.
We could impliment off the shelf gzip compression of the yaml file to get our 12 percent back though!
Simply put the .yml file into a gzip container and then decompress this file on demand. we could easily use theflow's rar implimentation also (or zip but its less efficient.)
https://github.com/TheOfficialFloW/VitaShell/tree/master/unrarlib
Ok, and why yaml, and not toml, json, xml, or just go full sqlite3?
Joining games under same name probably won't happen though.
Yaml is braindead simple to edit and has sane layout that an actual human can easily edit. Basiclfy it looks pretty. thats one reason to use yaml. and sqlite3 is a huge pain. It requires quite a bit more work to impliment unless you use an existing library.
Why not..
SQLite
As stated above, sqlite is a pain. Also, it misses the point of 'plaintext format' and 'easy to edit' - currently, the database has the useful property of being very easily shareable over a plaintext medium.
XML
XML has quite a lot of overhead. In general, while XML is more of a markup language, YAML is a data format, which doens't fit the use case here.
TOML
From the page itself:
Be warned, this spec is still changing a lot. Until it's marked as 1.0, you should assume that it is unstable and act accordingly.
JSON
This is the other choice in contrast. JSON is very easy to parse, fits the needs (readability, plain-text, resistant to delimiter collision), has been around for a long time, and takes about the same size as YAML. I chose YAML here because of its easier readability, but it comes down to:
- which one is easier to parse? JSON gets the mark here because it is much easier to parse than YAML.
- what features are needed? Interestingly, JSON is a subset of YAML. So, are all the features offered extra by YAML needed for this database? Personally, I don't think so; right now the db is a simple list of objects, and I don't expect it to go much further than that.
So, by looking more into it, it does seem that JSON should be more suited; if my conclusions are wrong, please rebute them.