Check if input is USFM itself before attempting to parse
How usfm_grammar 3.x would behave if we gave it a random text file? Can we do some checks like, if no \id found in the first 3 content lines of the file, then bail?
Here's what I use:
\id GEN ENG-US (p.sfm) - [GTP] Galilee Translation Project 2021[CC0] Hackett [7]
\id AAA BBB-CC (DDDD) - [EEE] Fffffff Fffffffffff Fffffff 2021[CC0] Kkkkkkk [L]
Where the ID line is (theoretically) parsed into variables
| Var | Example | Definitiion [spec] (data form) |
|---|---|---|
| id (&) | (all) | Project ID-complete field |
| id0 (&) | AAA | Project ID-Book ID [USFM] |
| id1 (&) | BBB | Project ID-ISO639 Language [p.sfm] |
| id2 (&) | CC | Project ID-ISO3166 Country [p.sfm] |
| id3 (&) | DDD | Project ID-Tagging Language [p.sfm] |
| id4 (&) | EEE | Project ID-Acronym (3Letter) [p.sfm] |
| id5 (&) | FFF | Project ID Title [p.sfm] |
| id6 (&) | GGG | Project ID Text Freeze Date [p.sfm] |
| id7 (&) | HHH | Project ID Rights Code [p.sfm] (creative commons) |
| id8 (&) | KKK | Project ID Rights Owner [p.sfm] (of final work) |
| id9 (&) | LLL | Project ID Status Level [p.sfm] (1-7 p.sfm publishing status, not 1-3 USFM community acceptance.) |
So, specifically to USFM conformance:
If exactly "(USFM)" is found before the first dash on the ID line, then the work should conform to the USFM standard listed with the \usfm tag, or USFM 2.5 if no \usfm tag is found.
This affects linking, tables and images.
- Links \jmp are only intra-document.
- Tables (\tr#) will have no preceeding definition \rem or closing \b tags, and will fail somewhat gracefully as blue text paragraphs instead.
- Images will have no print/display size information embedded into them.
Thank you @cmahte for sharing this! I could not, however, find any official documentation to support this syntax specification. Is this something Paratext (or some such software) does? If so, could you point to the documentation?