magika icon indicating copy to clipboard operation
magika copied to clipboard

Add more "basic" tests samples to cover supported content types

Open reyammer opened this issue 1 year ago • 5 comments

The new models ("standard_v2_x" and "standard_v3_0") supports 200+ content types: https://github.com/google/magika/tree/main/assets/models/standard_v3_0/README.md

Ideally, we have at least one "basic sample" for each of the supported content types (See /tests_data/basic/*).

This issue acts as a call for action -- external help is very welcome!

Important aspects to keep in mind:

  • Content types for which we have no samples yet should be prioritized. Among these, prioritize more common content types rather than niche ones.
  • The "basic" test samples (in the tests_data/basic/<content_type>/*) are supposed to be "easy to recognize". In other words, the goal for these samples is to check that the model does a reasonable job with clear-cut samples, rather than corner-cases.
  • It's OK to group a bunch of test cases in a single PR.
  • The PR should state the origin of each sample.
  • The samples should NOT be taken from existing projects / online resources (in these settings, it would be very challenging to properly document the origin of these files); they should be manually written/created by the PR author.

reyammer avatar Aug 30 '24 12:08 reyammer

I'd like to add a handful of basic tests for:

  • pickle
  • powershell
  • ttf
  • gif

miabobia avatar Aug 30 '24 14:08 miabobia

These would be very welcome! As indicated in the issue, please include a description on how these files were created (especially for the binary ones, such as pickle). Examples on how we created some of the test cases: create a new google doc, then "export as" various formats. Thanks!

reyammer avatar Aug 30 '24 14:08 reyammer

Where should I include my description of how I created the files?

miabobia avatar Aug 30 '24 14:08 miabobia

Where should I include my description of how I created the files?

Sorry I reread the issue and see it should be included in the PR now

miabobia avatar Aug 30 '24 14:08 miabobia

I made a list of samples added for further reference:
# Content Type Label Added
1 3gp
2 ace
3 ai
4 aidl
5 apk
6 applebplist
7 appleplist
8 asm
9 asp
10 autohotkey
11 autoit
12 awk
13 batch
14 bazel
15 bib
16 bmp
17 bzip
18 c
19 cab
20 cat
21 chm
22 clojure
23 cmake
24 cobol
25 coff
26 coffeescript
27 cpp
28 crt
29 crx
30 cs
31 csproj
32 css
33 csv
34 dart
35 deb
36 dex
37 dicom
38 diff
39 directory
40 dm
41 dmg
42 doc
43 dockerfile
44 docx
45 dsstore
46 dwg
47 dxf
48 elf
49 elixir
50 emf
51 eml
52 empty
53 epub
54 erb
55 erlang
56 flac
57 flv
58 fortran
59 gemfile
60 gemspec
61 gif
62 gitattributes
63 gitmodules
64 go
65 gradle
66 groovy
67 gzip
68 h5
69 handlebars
70 haskell
71 hcl
72 hlp
73 htaccess
74 html
75 icns
76 ico
77 ics
78 ignorefile
79 ini
80 internetshortcut
81 ipynb
82 iso
83 jar
84 java
85 javabytecode
86 javascript
87 jinja
88 jp2
89 jpeg
90 json
91 jsonl
92 julia
93 kotlin
94 latex
95 lha
96 lisp
97 lnk
98 lua
99 m3u
100 m4
101 macho
102 makefile
103 markdown
104 matlab
105 mht
106 midi
107 mkv
108 mp3
109 mp4
110 mscompress
111 msi
112 mum
113 npy
114 npz
115 nupkg
116 objectivec
117 ocaml
118 odp
119 ods
120 odt
121 ogg
122 one
123 onnx
124 otf
125 outlook
126 parquet
127 pascal
128 pcap
129 pdb
130 pdf
131 pebin
132 pem
133 perl
134 php
135 pickle
136 png
137 po
138 postscript
139 powershell
140 ppt
141 pptx
142 prolog
143 proteindb
144 proto
145 psd
146 python
147 pythonbytecode
148 pytorch
149 qt
150 r
151 rar
152 rdf
153 rpm
154 rst
155 rtf
156 ruby
157 rust
158 scala
159 scss
160 sevenzip
161 sgml
162 shell
163 smali
164 snap
165 solidity
166 sql
167 sqlite
168 squashfs
169 srt
170 stlbinary
171 stltext
172 sum
173 svg
174 swf
175 swift
176 symlink
177 tar
178 tcl
179 textproto
180 tga
181 thumbsdb
182 tiff
183 toml
184 torrent
185 tsv
186 ttf
187 twig
188 txt
189 typescript
190 unknown
191 vba
192 vcxproj
193 verilog
194 vhdl
195 vtt
196 vue
197 wasm
198 wav
199 webm
200 webp
201 winregistry
202 wmf
203 woff
204 woff2
205 xar
206 xls
207 xlsb
208 xlsx
209 xml
210 xpi
211 xz
212 yaml
213 yara
214 zig
215 zip

dukecat0 avatar Nov 02 '25 06:11 dukecat0