Add more "basic" tests samples to cover supported content types
The new models ("standard_v2_x" and "standard_v3_0") supports 200+ content types: https://github.com/google/magika/tree/main/assets/models/standard_v3_0/README.md
Ideally, we have at least one "basic sample" for each of the supported content types (See /tests_data/basic/*).
This issue acts as a call for action -- external help is very welcome!
Important aspects to keep in mind:
- Content types for which we have no samples yet should be prioritized. Among these, prioritize more common content types rather than niche ones.
- The "basic" test samples (in the
tests_data/basic/<content_type>/*) are supposed to be "easy to recognize". In other words, the goal for these samples is to check that the model does a reasonable job with clear-cut samples, rather than corner-cases. - It's OK to group a bunch of test cases in a single PR.
- The PR should state the origin of each sample.
- The samples should NOT be taken from existing projects / online resources (in these settings, it would be very challenging to properly document the origin of these files); they should be manually written/created by the PR author.
I'd like to add a handful of basic tests for:
- pickle
- powershell
- ttf
- gif
These would be very welcome! As indicated in the issue, please include a description on how these files were created (especially for the binary ones, such as pickle). Examples on how we created some of the test cases: create a new google doc, then "export as" various formats. Thanks!
Where should I include my description of how I created the files?
Where should I include my description of how I created the files?
Sorry I reread the issue and see it should be included in the PR now
I made a list of samples added for further reference:
| # | Content Type Label | Added |
|---|---|---|
| 1 | 3gp | |
| 2 | ace | |
| 3 | ai | |
| 4 | aidl | |
| 5 | apk | |
| 6 | applebplist | |
| 7 | appleplist | |
| 8 | asm | ✓ |
| 9 | asp | |
| 10 | autohotkey | |
| 11 | autoit | |
| 12 | awk | |
| 13 | batch | ✓ |
| 14 | bazel | |
| 15 | bib | |
| 16 | bmp | |
| 17 | bzip | |
| 18 | c | ✓ |
| 19 | cab | |
| 20 | cat | |
| 21 | chm | |
| 22 | clojure | |
| 23 | cmake | |
| 24 | cobol | |
| 25 | coff | |
| 26 | coffeescript | |
| 27 | cpp | |
| 28 | crt | |
| 29 | crx | |
| 30 | cs | |
| 31 | csproj | |
| 32 | css | ✓ |
| 33 | csv | ✓ |
| 34 | dart | |
| 35 | deb | |
| 36 | dex | |
| 37 | dicom | |
| 38 | diff | |
| 39 | directory | |
| 40 | dm | |
| 41 | dmg | |
| 42 | doc | |
| 43 | dockerfile | ✓ |
| 44 | docx | ✓ |
| 45 | dsstore | |
| 46 | dwg | |
| 47 | dxf | |
| 48 | elf | |
| 49 | elixir | |
| 50 | emf | |
| 51 | eml | ✓ |
| 52 | empty | ✓ |
| 53 | epub | ✓ |
| 54 | erb | |
| 55 | erlang | |
| 56 | flac | ✓ |
| 57 | flv | |
| 58 | fortran | |
| 59 | gemfile | |
| 60 | gemspec | |
| 61 | gif | |
| 62 | gitattributes | |
| 63 | gitmodules | |
| 64 | go | |
| 65 | gradle | |
| 66 | groovy | |
| 67 | gzip | |
| 68 | h5 | |
| 69 | handlebars | ✓ |
| 70 | haskell | |
| 71 | hcl | |
| 72 | hlp | |
| 73 | htaccess | |
| 74 | html | ✓ |
| 75 | icns | |
| 76 | ico | |
| 77 | ics | |
| 78 | ignorefile | ✓ |
| 79 | ini | ✓ |
| 80 | internetshortcut | |
| 81 | ipynb | |
| 82 | iso | |
| 83 | jar | |
| 84 | java | |
| 85 | javabytecode | |
| 86 | javascript | ✓ |
| 87 | jinja | ✓ |
| 88 | jp2 | |
| 89 | jpeg | ✓ |
| 90 | json | ✓ |
| 91 | jsonl | |
| 92 | julia | |
| 93 | kotlin | |
| 94 | latex | ✓ |
| 95 | lha | |
| 96 | lisp | |
| 97 | lnk | |
| 98 | lua | |
| 99 | m3u | |
| 100 | m4 | |
| 101 | macho | |
| 102 | makefile | ✓ |
| 103 | markdown | ✓ |
| 104 | matlab | |
| 105 | mht | ✓ |
| 106 | midi | |
| 107 | mkv | |
| 108 | mp3 | ✓ |
| 109 | mp4 | |
| 110 | mscompress | |
| 111 | msi | |
| 112 | mum | |
| 113 | npy | |
| 114 | npz | |
| 115 | nupkg | |
| 116 | objectivec | |
| 117 | ocaml | |
| 118 | odp | ✓ |
| 119 | ods | ✓ |
| 120 | odt | ✓ |
| 121 | ogg | ✓ |
| 122 | one | |
| 123 | onnx | |
| 124 | otf | |
| 125 | outlook | ✓ |
| 126 | parquet | |
| 127 | pascal | |
| 128 | pcap | |
| 129 | pdb | |
| 130 | ✓ | |
| 131 | pebin | |
| 132 | pem | ✓ |
| 133 | perl | |
| 134 | php | |
| 135 | pickle | |
| 136 | png | ✓ |
| 137 | po | |
| 138 | postscript | |
| 139 | powershell | |
| 140 | ppt | |
| 141 | pptx | ✓ |
| 142 | prolog | |
| 143 | proteindb | |
| 144 | proto | |
| 145 | psd | ✓ |
| 146 | python | ✓ |
| 147 | pythonbytecode | ✓ |
| 148 | pytorch | ✓ |
| 149 | qt | |
| 150 | r | |
| 151 | rar | |
| 152 | rdf | |
| 153 | rpm | |
| 154 | rst | |
| 155 | rtf | ✓ |
| 156 | ruby | |
| 157 | rust | ✓ |
| 158 | scala | |
| 159 | scss | |
| 160 | sevenzip | |
| 161 | sgml | |
| 162 | shell | |
| 163 | smali | ✓ |
| 164 | snap | |
| 165 | solidity | |
| 166 | sql | |
| 167 | sqlite | |
| 168 | squashfs | |
| 169 | srt | ✓ |
| 170 | stlbinary | |
| 171 | stltext | |
| 172 | sum | |
| 173 | svg | ✓ |
| 174 | swf | |
| 175 | swift | ✓ |
| 176 | symlink | |
| 177 | tar | |
| 178 | tcl | |
| 179 | textproto | |
| 180 | tga | |
| 181 | thumbsdb | |
| 182 | tiff | |
| 183 | toml | ✓ |
| 184 | torrent | |
| 185 | tsv | ✓ |
| 186 | ttf | |
| 187 | twig | ✓ |
| 188 | txt | ✓ |
| 189 | typescript | ✓ |
| 190 | unknown | |
| 191 | vba | |
| 192 | vcxproj | |
| 193 | verilog | |
| 194 | vhdl | |
| 195 | vtt | |
| 196 | vue | |
| 197 | wasm | |
| 198 | wav | ✓ |
| 199 | webm | |
| 200 | webp | |
| 201 | winregistry | |
| 202 | wmf | |
| 203 | woff | |
| 204 | woff2 | |
| 205 | xar | |
| 206 | xls | |
| 207 | xlsb | |
| 208 | xlsx | ✓ |
| 209 | xml | |
| 210 | xpi | |
| 211 | xz | |
| 212 | yaml | ✓ |
| 213 | yara | ✓ |
| 214 | zig | ✓ |
| 215 | zip | ✓ |