libsai Implement initial Sai2 support

Implements initial support for .sai2 file format.

Addresses https://github.com/Wunkolo/libsai/issues/14

Mar 01 '25 19:03 Wunkolo

Scanlines of a dpcm tile seem to implement something similar to PNG's Up filter seen here: http://www.libpng.org/pub/png/spec/1.2/PNG-Filters.html

The first scanline of a tile is always stored literally, and rows after will store deltas pixel-values of the previous row. This allows tiles to compress more optimally if the pixel-contents are flat regions of the same color(a similar benefit of PNG's compression).

Mar 16 '25 05:03 Wunkolo

Hey, I have written this "partial" SAI2 format specification, i.e. "what I know so far" - https://github.com/photopea/SAI2-specification/ , but maybe you know all this.

The number of tiles of a layer (C) is stored in the "layr" chunk.

Mar 16 '25 15:03 photopea

Wow, did you manage to understand the SAI2 pixel compression?

Could you maybe describe it in a natural language, so that I can write it into https://github.com/photopea/SAI2-specification/ ? Also, did you check my specification? Did you know all this?

Apr 27 '25 05:04 photopea

There are multiple formats used for thumbnails: jssf and dpcm. jssf seems to be used in older versions of Sai2 in particular and seems to be very complex, and involves an intermediate color-space conversion. I am still figuring this out.

dpcm, used by newer versions of sai, is basically figured out but it's due for some more testing. The main function for that is here: https://github.com/Wunkolo/libsai/blob/31c7d4d472b3da69c2d4c8de9a77414683b6f891/samples/Thumbnail-Sai2.cpp#L242-L361 Each tile is compressed, and reach row within a tile is delta-compressed using the previous row. Very similar to the PNG-specification's "up" filter. The delta-compressed data itself is also compressed using the a modified Run-Length-Encoding scheme seen here: https://github.com/Wunkolo/libsai/blob/31c7d4d472b3da69c2d4c8de9a77414683b6f891/source/sai2.cpp#L164-L311

I am still working on this in my spare time. Trying to document it all in plain-English for someone else while I am still researching and experimenting would slow me down at the moment. When I do care to document it in plain-English, it would just be within this repo similar to this one.

For now, your best bet would be to check my notes in the actual code to follow along if you wish until I start drafting up a specification.

Apr 27 '25 06:04 Wunkolo

Here a viewer with support for sai and sai2 files. Bugs, lags and crashes are quite likely.

Jul 05 '25 15:07 Mickommic

@Wunkolo I guess that @Mickommic made a Github account today to ask us to download a virus? :D

Jul 05 '25 18:07 photopea

Try it on a virtual machine or in a sandbox. :D

Jul 05 '25 18:07 Mickommic

Hello! I want to thank Wunkolo for his description of sai and sai2. My respect to you. Can you help me with this?

function Sai2ReadThumb(Dst: Pointer; Stream: TStream; Info: PiSai2Info; PixelStep, LineStep: Inteher): Boolean;
var
  i, j, w, h: Integer;
  c: Word;
begin
  Result:= False;

  if (Dst = nil) or (Stream = nil)
  or (Info = nil) or (Info^.ThumbOffset = 0) then Exit;

  Stream.Position:= Info^.ThumbOffset;
  Stream.Read(w, 4);//width
  Stream.Read(h, 4);//height
  Stream.Read(i, 4);

  if TiFourChar(i) <> 'jssf' then Exit;

  Stream.Read(i, 2);//width again
  Stream.Read(j, 2);//height again

  if (Word(i) <> Word(w)) or (Word(j) <> Word(h)) then Exit;

  Stream.Read(c, 2);//channels?
(*
then there are 128 bytes of data repeated in each file

02 02 02 02 02 02 03 03 03 03 04 04 04 04 04 04
04 04 04 04 04 06 06 06 06 06 06 06 06 06 06 06
06 06 06 06 08 08 08 08 08 08 08 09 09 09 09 09
09 0A 0A 0A 0A 0A 0C 0C 0C 0C 0E 0E 0E 10 10 10
06 06 06 06 06 06 08 08 08 08 09 09 09 09 09 0C
0C 0C 0C 0C 0C 10 10 10 10 10 10 10 13 13 13 13
13 13 13 13 17 17 17 17 17 17 17 1B 1B 1B 1B 1B
1B 20 20 20 20 20 25 25 25 25 25 25 25 25 25 25
*)


//I don't know how to interpret this



  Exit;


  Result:= True;
end;

Jul 07 '25 06:07 Mickommic

Can you help me with this?

I am still deciphering jssf data in particular at the moment and you can see the state of it here. This makes two people that want me to contribute to their own project's code-base... I'd rather people just follow the code I am writing so I am not bifurcated across several projects.

Jul 07 '25 14:07 Wunkolo

@Wunkolo I think that @Mickommic is a bot.

BTW. I don't want you to contribute to my codebase! I just wanted you to write a description of the SAI2 format in a human language :)

Jul 07 '25 14:07 photopea

Can you help me with this?

I am still deciphering jssf data in particular at the moment and you can see the state of it here. This makes two people that want me to contribute to their own project's code-base... I'd rather people just follow the code I am writing so I am not bifurcated across several projects.

Okay, I got it. Here is the source code of my sai2 file decoder written according to your descriptions. Maybe it will be useful to you. And many thanks to you and everyone who helps you for your work.

Jul 07 '25 17:07 Mickommic

Hello! I believe that jssf is a JPEG stream without any jpeg headers, except for two quantization tables. One table for the luminance channel,

02 02 02 02 02 02 03 03
03 03 04 04 04 04 04 04
04 04 04 04 04 06 06 06
06 06 06 06 06 06 06 06
06 06 06 06 08 08 08 08
08 08 08 09 09 09 09 09
09 0A 0A 0A 0A 0A 0C 0C
0C 0C 0E 0E 0E 10 10 10

the other for the chrominance channels.

06 06 06 06 06 06 08 08
08 08 09 09 09 09 09 0C
0C 0C 0C 0C 0C 10 10 10
10 10 10 10 13 13 13 13
13 13 13 13 17 17 17 17
17 17 17 1B 1B 1B 1B 1B
1B 20 20 20 20 20 25 25
25 25 25 25 25 25 25 25

I tried to decode the bitstream using the typical Huffman tables defined in the JPEG specification. But it didn't work. I believe that SYSTEMAX has predefined its own Huffman tables, which should be contained in the exe in the constants section.

Jul 08 '25 12:07 Mickommic

This is what I am suspect of as well, even just based on the jssf-name alone and some of the subroutines around reading/writing. If the user's system has more than 1 core, then it seems to dispatch threads that work in 32x32-pixel tiles, indicating that these 4096-byte(or less) chunks of data are likely a specialized jfif stream like you describe. It's possible it's storing DCT-coefficients for 32x32 tiles, or a 4x4 tile of 8x8 tiles more alike to the jpeg standard.

Jul 08 '25 13:07 Wunkolo

It might be a little bit more direct than that, here are the suspected subroutines that each thread uses when decompressing Jssf data, there are some SSE instructions to decipher still to see what's going on here at a higher level. My first suspicion linked earlier was that it was just a compressed stream of YUV data since I seemed to have noticed some matrix-like color conversion arithmetic that used coefficients similar to YUV->RGB conversions, which is the color-space that JPEG uses.

Jssf Token Decompression?

__int64 __fastcall DecompressJssfToken(
        CanvasTileDispatchProc *DispatchInfo,
        void *UserData,
        __int64 Dest,
        int Channel,
        const __m128i *a5)
{
  __m128i *v5; // rbx
  __int64 Channel_1; // rsi

  v5 = (__m128i *)*((_QWORD *)UserData + 8);
  Channel_1 = Channel;
  sub_140206410(a5, *((__m128 **)UserData + 9), v5);
  sub_1402060F4(
    (__m128 *)&DispatchInfo->field_2D0
  + 16 * (unsigned __int64)*((unsigned __int8 *)&DispatchInfo->field_11C0 + 10 * Channel_1 + 5),
    (__m128 *)&DispatchInfo->field_4D0,
    v5,
    v5);
  return sub_1401FF5A0((__int64)DispatchInfo, Dest, Channel_1, v5->m128i_i16);
}

sub_140206410

void __fastcall sub_140206410(const __m128i *a1, __m128 *a2, _QWORD *a3)
{
  __m128 *DestFloat; // rdi
  __m128i si128; // xmm5
  __m128i v6; // xmm6
  __m128i v7; // xmm7
  int v8; // ecx
  __m128i v9; // xmm1
  __m128i v10; // xmm2
  __m128i v11; // xmm4
  __m128i v12; // xmm1
  __m128i v13; // xmm3
  __m128i v14; // xmm1
  __m128i v15; // xmm1
  __m128i v16; // xmm4
  __m128i v17; // xmm2
  __m128i v18; // xmm3
  __m128i v19; // xmm2
  __m128i v20; // xmm4
  __m128i v21; // xmm2
  __m128 *v22; // rsi
  __m128 v24; // xmm8
  int v25; // ecx
  __m128 v26; // xmm2
  __m128 v27; // xmm3
  __m128 v28; // xmm4
  __m128 v29; // xmm5
  __m128 v30; // xmm0
  __m128 v31; // xmm1
  __m128 v32; // xmm2
  __m128 v33; // xmm3
  __m128 v34; // xmm4
  __m128i v35; // xmm0
  __m128i v36; // xmm4
  __m128 v37; // xmm0
  __m128i v38; // xmm2
  __m128i v39; // xmm3
  __m128 v40; // xmm4
  __m128 v41; // xmm5
  __m128 v42; // xmm6
  __m128 v43; // xmm7
  __m128 v44; // xmm9
  __m128 v45; // xmm0
  __m128 v46; // xmm1
  __m128 v47; // xmm2
  __m128 v48; // xmm3
  __m128i v49; // xmm4
  __m128i v50; // xmm5
  __m128i v51; // xmm6
  __m128i v52; // xmm7

  DestFloat = a2;
  si128 = _mm_load_si128((const __m128i *)&xmmword_140206240);
  v6 = _mm_load_si128((const __m128i *)&xmmword_140206250);
  v7 = _mm_load_si128((const __m128i *)&xmmword_140206260);
  v8 = 8;
  do
  {
    --v8;
    v9 = _mm_load_si128(a1++);
    v10 = _mm_srli_si128(_mm_shufflehi_epi16(v9, 27), 8);
    v11 = _mm_sub_epi16(_mm_move_epi64(v9), v10);
    v12 = _mm_add_epi16(v9, v10);
    v13 = _mm_shufflelo_epi16(v12, 187);
    v14 = _mm_unpacklo_epi32(_mm_add_epi16(v12, v13), _mm_sub_epi16(_mm_move_epi64(v12), v13));
    v15 = _mm_unpacklo_epi64(v14, v14);
    v16 = _mm_unpacklo_epi64(v11, v11);
    v17 = _mm_shufflehi_epi16(_mm_shufflelo_epi16(v16, 235), 250);
    v18 = _mm_madd_epi16(_mm_add_epi16(_mm_shufflehi_epi16(_mm_shufflelo_epi16(v16, 80), 65), v17), v6);
    v19 = _mm_add_epi16(v17, v16);
    v20 = _mm_add_epi32(
            _mm_madd_epi16(
              _mm_unpacklo_epi16(v16, _mm_shufflelo_epi16(_mm_add_epi16(v19, _mm_shufflelo_epi16(v19, 85)), 0)),
              v7),
            v18);
    v21 = _mm_madd_epi16(
            _mm_add_epi16(v15, _mm_shufflehi_epi16(_mm_shufflelo_epi16(_mm_srli_epi64(v15, 0x20u), 63), 223)),
            si128);
    *DestFloat = _mm_cvtepi32_ps(_mm_unpacklo_epi32(v21, v20));
    DestFloat[1] = _mm_cvtepi32_ps(_mm_unpackhi_epi32(v21, v20));
    DestFloat += 2;
  }
  while ( v8 );
  v22 = a2;
  v24 = (__m128)_mm_load_si128((const __m128i *)&xmmword_140206270);
  v25 = 2;
  do
  {
    --v25;
    v26 = _mm_add_ps(*v22, v22[14]);
    v27 = _mm_add_ps(v22[2], v22[12]);
    v28 = _mm_add_ps(v22[4], v22[10]);
    v29 = _mm_add_ps(v22[6], v22[8]);
    v30 = _mm_add_ps(v26, v29);
    v31 = _mm_add_ps(v27, v28);
    v32 = _mm_sub_ps(v26, v29);
    v33 = _mm_sub_ps(v27, v28);
    v34 = v30;
    v35 = _mm_cvtps_epi32(_mm_mul_ps(_mm_add_ps(v30, v31), v24));
    *a3 = _mm_packs_epi32(v35, v35).m128i_u64[0];
    v36 = _mm_cvtps_epi32(_mm_mul_ps(_mm_sub_ps(v34, v31), v24));
    a3[8] = _mm_packs_epi32(v36, v36).m128i_u64[0];
    v37 = _mm_mul_ps(_mm_add_ps(v32, v33), (__m128)xmmword_1402062A0);
    v38 = _mm_cvtps_epi32(_mm_add_ps(_mm_mul_ps(v32, (__m128)xmmword_1402062B0), v37));
    a3[4] = _mm_packs_epi32(v38, v38).m128i_u64[0];
    v39 = _mm_cvtps_epi32(_mm_add_ps(_mm_mul_ps(v33, (__m128)xmmword_1402062F0), v37));
    a3[12] = _mm_packs_epi32(v39, v39).m128i_u64[0];
    v40 = _mm_sub_ps(*v22, v22[14]);
    v41 = _mm_sub_ps(v22[2], v22[12]);
    v42 = _mm_sub_ps(v22[4], v22[10]);
    v43 = _mm_sub_ps(v22[6], v22[8]);
    v44 = _mm_mul_ps(_mm_add_ps(_mm_add_ps(_mm_add_ps(v40, v41), v42), v43), (__m128)xmmword_1402062D0);
    v45 = _mm_mul_ps(_mm_add_ps(v40, v43), (__m128)xmmword_1402062C0);
    v46 = _mm_mul_ps(_mm_add_ps(v41, v42), (__m128)xmmword_140206320);
    v47 = _mm_mul_ps(_mm_add_ps(v40, v42), (__m128)xmmword_140206290);
    v48 = _mm_mul_ps(_mm_add_ps(v41, v43), (__m128)xmmword_140206300);
    v49 = _mm_cvtps_epi32(_mm_add_ps(_mm_add_ps(_mm_add_ps(_mm_mul_ps(v40, (__m128)xmmword_1402062E0), v45), v47), v44));
    a3[2] = _mm_packs_epi32(v49, v49).m128i_u64[0];
    v50 = _mm_cvtps_epi32(_mm_add_ps(_mm_add_ps(_mm_add_ps(_mm_mul_ps(v41, (__m128)xmmword_140206330), v46), v48), v44));
    a3[6] = _mm_packs_epi32(v50, v50).m128i_u64[0];
    v51 = _mm_cvtps_epi32(_mm_add_ps(_mm_add_ps(_mm_add_ps(_mm_mul_ps(v42, (__m128)xmmword_140206310), v46), v47), v44));
    a3[10] = _mm_packs_epi32(v51, v51).m128i_u64[0];
    v52 = _mm_cvtps_epi32(_mm_add_ps(_mm_add_ps(_mm_add_ps(_mm_mul_ps(v43, (__m128)xmmword_140206280), v45), v48), v44));
    a3[14] = _mm_packs_epi32(v52, v52).m128i_u64[0];
    ++v22;
    ++a3;
  }
  while ( v25 );
}

sub_1402060F4

__int64 __fastcall sub_1402060F4(__m128 *a1, __m128 *a2, const __m128i *a3, __m128i *a4)
{
  __m128 v7; // xmm4
  __int64 result; // rax
  __m128 v9; // xmm5
  __m128 v10; // xmm5
  int v11; // ecx
  __m128i si128; // xmm2
  __m128 v13; // xmm1
  __m128 v14; // xmm2

  v7 = (__m128)_mm_slli_epi32((__m128i)-1i64, 0x1Fu);
  result = 0x3F800000i64;
  v9 = (__m128)_mm_cvtsi32_si128(0x3F800000u);
  v10 = _mm_shuffle_ps(v9, v9, 0);
  v11 = 8;
  do
  {
    --v11;
    si128 = _mm_load_si128(a3);
    v13 = _mm_mul_ps(_mm_cvtepi32_ps(_mm_srai_epi32(_mm_unpacklo_epi16(si128, si128), 0x10u)), *a1);
    v14 = _mm_mul_ps(_mm_cvtepi32_ps(_mm_srai_epi32(_mm_unpackhi_epi16(si128, si128), 0x10u)), a1[1]);
    *a4 = _mm_packs_epi32(
            _mm_cvttps_epi32(_mm_add_ps(v13, _mm_mul_ps(_mm_or_ps(_mm_and_ps(v13, v7), v10), *a2))),
            _mm_cvttps_epi32(_mm_add_ps(v14, _mm_mul_ps(_mm_or_ps(_mm_and_ps(v14, v7), v10), a2[1]))));
    a1 += 2;
    a2 += 2;
    ++a3;
    ++a4;
  }
  while ( v11 );
  return result;
}

sub_1401FF5A0

__int64 __fastcall sub_1401FF5A0(__int64 a1, __int64 a2, int a3, __int16 *a4)
{
  int v6; // r12d
  __int64 v8; // rsi
  unsigned int v9; // ebx
  __int64 v10; // rcx
  __int64 v11; // r8
  __int64 v12; // rdx
  unsigned __int64 v13; // rax
  int *v14; // r12
  unsigned __int64 v15; // rax
  int v16; // esi
  unsigned __int16 *v17; // rdi
  __int64 v18; // rax
  int v19; // r13d
  unsigned __int64 v20; // rbx
  int v21; // ebx
  unsigned int v23; // [rsp+60h] [rbp+18h]

  v6 = *a4 - *(__int16 *)(a2 + 2i64 * a3);
  v8 = 5i64 * a3;
  v23 = *a4;
  *(_WORD *)(a2 + 2i64 * a3) = *a4;
  if ( v6 )
  {
    _BitScanReverse(&v9, abs32(v6));
    v10 = (int)++v9 + 16i64 * *(unsigned __int8 *)(a1 + 10i64 * a3 + 4550);
    sub_1401FEEB0(a2, *(unsigned __int16 *)(a1 + 4 * v10 + 1488), *(unsigned __int16 *)(a1 + 4 * v10 + 1490));
    v11 = v9;
    v12 = ((1 << v9) - 1) & (unsigned int)(v6 + (v6 >> 31));
  }
  else
  {
    v13 = (unsigned __int64)*(unsigned __int8 *)(a1 + 10i64 * a3 + 4550) << 6;
    v11 = *(unsigned __int16 *)(v13 + a1 + 1490);
    v12 = *(unsigned __int16 *)(v13 + a1 + 1488);
  }
  sub_1401FEEB0(a2, v12, v11);
  v14 = (int *)&unk_1402906C4;
  v15 = (unsigned __int64)*(unsigned __int8 *)(a1 + 2 * v8 + 4551) << 10;
  v16 = 0;
  v17 = (unsigned __int16 *)(v15 + a1 + 1616);
  do
  {
    v18 = *v14;
    v19 = a4[v18];
    if ( a4[v18] )
    {
      if ( v16 >= 16 )
      {
        v20 = (unsigned __int64)(unsigned int)v16 >> 4;
        v16 += -16 * ((unsigned int)v16 >> 4);
        do
        {
          sub_1401FEEB0(a2, v17[480], v17[481]);
          --v20;
        }
        while ( v20 );
      }
      _BitScanReverse((unsigned int *)&v21, abs32(v19));
      ++v21;
      sub_1401FEEB0(
        a2,
        v17[2 * (v21 | (unsigned __int64)(16 * v16))],
        v17[2 * (v21 | (unsigned __int64)(16 * v16)) + 1]);
      sub_1401FEEB0(a2, ((1 << v21) - 1) & (unsigned int)(v19 + (v19 >> 31)), (unsigned int)v21);
      v16 = 0;
    }
    else
    {
      ++v16;
    }
    ++v14;
  }
  while ( (__int64)v14 < (__int64)&unk_1402907C0 );
  if ( !v19 )
    sub_1401FEEB0(a2, *v17, v17[1]);
  return v23;
}

Jul 08 '25 14:07 Wunkolo

It might be a little bit more direct than that, here are the suspected subroutines that each thread uses when decompressing Jssf data, there are some SSE instructions to decipher still to see what's going on here at a higher level. My first suspicion linked earlier was that it was just a compressed stream of YUV data since I seemed to have noticed some matrix-like color conversion arithmetic that used coefficients similar to YUV->RGB conversions, which is the color-space that JPEG uses.

Jssf Token Decompression? sub_140206410 sub_1402060F4 sub_1401FF5A0

Hello! I have to disappoint you Wunkolo, but these are 8x8 block encoding procedures with ForwardDCT, Quantization and Huffman coding. See my comments

DecompressJssfToken

__int64 __fastcall DecompressJssfToken(//8x8 sample block encoding procedure
        CanvasTileDispatchProc *DispatchInfo,
        void *UserData,
        __int64 Dest,
        int Channel,
        const __m128i *a5)
{
  __m128i *v5; // rbx
  __int64 Channel_1; // rsi

  v5 = (__m128i *)*((_QWORD *)UserData + 8);
  Channel_1 = Channel;
  sub_140206410(a5, *((__m128 **)UserData + 9), v5);//Forward DCT procedure
  sub_1402060F4(                        //Quantization procedure
    (__m128 *)&DispatchInfo->field_2D0
  + 16 * (unsigned __int64)*((unsigned __int8 *)&DispatchInfo->field_11C0 + 10 * Channel_1 + 5),
    (__m128 *)&DispatchInfo->field_4D0,
    v5,
    v5);
  return sub_1401FF5A0((__int64)DispatchInfo, Dest, Channel_1, v5->m128i_i16);//Huffman coding procedure
}

sub_140206410 - Forward DCT procedure

void __fastcall sub_140206410(const __m128i *a1, __m128 *a2, _QWORD *a3)//Forward DCT procedure
{
  __m128 *DestFloat; // rdi
  __m128i si128; // xmm5
//...
  __m128i v51; // xmm6
  __m128i v52; // xmm7

 DestFloat = a2;
  si128 = _mm_load_si128((const __m128i *)&xmmword_140206240);//si128 = W(o0, o1, o2, o3, o4, o5, o6, o7) multipliers
  v6 = _mm_load_si128((const __m128i *)&xmmword_140206250);//v6 = W(m0, m1, m2, m3, m4, m5, m6, m7) multipliers
  v7 = _mm_load_si128((const __m128i *)&xmmword_140206260);//v7 = W(n0, n1, n2, n3, n4, n5, n6, n7) multipliers
  v8 = 8;
  do
  {//per line cycle in block 8x8
    --v8;
    v9 = _mm_load_si128(a1++);//v9 = W(a0, a1, a2, a3, a4, a5, a6, a7)
    v10 = _mm_srli_si128(_mm_shufflehi_epi16(v9, 27), 8);//v10 = W(a7, a6, a5, a4, zeros)

                                                 //          b0     b1     b2     b3
    v11 = _mm_sub_epi16(_mm_move_epi64(v9), v10);//v11 = W(a0-a7, a1-a6, a2-a5, a3-a4, zeros)

                                 //          c0     c1     c2     c3
    v12 = _mm_add_epi16(v9, v10);//v12 = W(a0+a7, a1+a6, a2+a5, a3+a4, trash)
    v13 = _mm_shufflelo_epi16(v12, 187);//v13 = W(c3, c2, trash)
                                                                                               //          d0     d1     d2     d3
    v14 = _mm_unpacklo_epi32(_mm_add_epi16(v12, v13), _mm_sub_epi16(_mm_move_epi64(v12), v13));//v14 = W(c0+c3, c0-c3, c1+c2, c1-c2, trash)
    v15 = _mm_unpacklo_epi64(v14, v14);//v15 = W(d0, d1, d2, d3, d0, d1, d2, d3)
    v16 = _mm_unpacklo_epi64(v11, v11);//v16 = W(b0, b1, b2, b3, b0, b1, b2, b3)
    v17 = _mm_shufflehi_epi16(_mm_shufflelo_epi16(v16, 235), 250);//v17 = W(b3, b2, b2, b3, b2, b2, b3, b3)

    v18 = _mm_madd_epi16(_mm_add_epi16(_mm_shufflehi_epi16(_mm_shufflelo_epi16(v16, 80), 65), v17), v6);
//                   e0                     e1                     e2                     e3
//v18 = DW((b0+b3)*m0+(b0+b2)*m1, (b1+b2)*m2+(b1+b3)*m3, (b1+b2)*m4+(b0+b2)*m5, (b0+b3)*m6+(b1+b3)*m7)

                                  //          f0     f1
    v19 = _mm_add_epi16(v17, v16);//v19 = W(b0+b3, b1+b2, trash)
    v20 = _mm_add_epi32(
            _mm_madd_epi16(
              _mm_unpacklo_epi16(v16, _mm_shufflelo_epi16(_mm_add_epi16(v19, _mm_shufflelo_epi16(v19, 85)), 0)),
              v7),
            v18);
//                  g0                   g1                   g2                   g3
//v20 = DW(b0*n0+(f0+f1)*n1+e0, b1*n2+(f0+f1)*n3+e1, b2*n4+(f0+f1)*n5+e2, b3*n6+(f0+f1)*n7+e3)

    v21 = _mm_madd_epi16(
            _mm_add_epi16(v15, _mm_shufflehi_epi16(_mm_shufflelo_epi16(_mm_srli_epi64(v15, 0x20u), 63), 223)),
            si128);
//              h0               h1            h2                h3
//v21 = DW(d0*o0+d1*o1, d2*o2+(d2+d3)*o3, d0*o4+d1*o5, (d2+d3)*o6+(d2+d3)*o7)

    *DestFloat = _mm_cvtepi32_ps(_mm_unpacklo_epi32(v21, v20));//S(g0, h0, g1, h1)
    DestFloat[1] = _mm_cvtepi32_ps(_mm_unpackhi_epi32(v21, v20));//S(g2, h2, g3, h3)
    DestFloat += 2;
  }
  while ( v8 );
  v22 = a2;
//       first iteration   second iteration of cycle
//v22 =	[a0, b0, c0, d0,   a0, b0, c0, d0
//       a1, b1, c1, d1,   a1, b1, c1, d1
//       a2, b2, c2, d2,   a2, b2, c2, d2
//       a3, b3, c3, d2,   a3, b3, c3, d3
//       a4, b4, c4, d3,   a4, b4, c4, d4
//       a5, b5, c5, d4,   a5, b5, c5, d5
//       a6, b6, c6, d5,   a6, b6, c7, d6
//       a7, b7, c7, d6,   a7, b7, c8, d7]
  v24 = (__m128)_mm_load_si128((const __m128i *)&xmmword_140206270);
  v25 = 2;
  do
  {
    --v25;
    v26 = _mm_add_ps(*v22, v22[14]);
//          e0     e1     e2     e3
//v26 = S(a0+a7, b0+b7, c0+c7, d0+d7)

    v27 = _mm_add_ps(v22[2], v22[12]);
//          f0     f1     f2     f3
//v27 = S(a1+a6, b1+b6, c1+c6, d1+d6)

    v28 = _mm_add_ps(v22[4], v22[10]);
//          g0     g1     g2     g3
//v28 = S(a2+a5, b2+b5, c2+c5, d2+d5)

    v29 = _mm_add_ps(v22[6], v22[8]);
//          h0     h1     h2     h3
//v29 = S(a3+a4, b3+b4, c3+c4, d3+d4)

    v30 = _mm_add_ps(v26, v29);
//          i0     i1     i2     i3
//v30 = S(e0+h0, e1+h1, e2+h2, e3+h4)

    v31 = _mm_add_ps(v27, v28);
//          j0     j1     j2     j3
//v31 = S(f0+g0, f1+g1, f2+g2, f3+g4)

    v32 = _mm_sub_ps(v26, v29);
//          k0     k1     k2     k3
//v32 = S(e0-h0, e1-h1, e2-h2, e3-h4)

    v33 = _mm_sub_ps(v27, v28);
//          l0     l1     l2     l3
//v31 = S(f0-g0, f1-g1, f2-g2, f3-g4)

    v34 = v30;
    v35 = _mm_cvtps_epi32(_mm_mul_ps(_mm_add_ps(v30, v31), v24));
    *a3 = _mm_packs_epi32(v35, v35).m128i_u64[0];
    v36 = _mm_cvtps_epi32(_mm_mul_ps(_mm_sub_ps(v34, v31), v24));
    a3[8] = _mm_packs_epi32(v36, v36).m128i_u64[0];
    v37 = _mm_mul_ps(_mm_add_ps(v32, v33), (__m128)xmmword_1402062A0);
    v38 = _mm_cvtps_epi32(_mm_add_ps(_mm_mul_ps(v32, (__m128)xmmword_1402062B0), v37));
    a3[4] = _mm_packs_epi32(v38, v38).m128i_u64[0];
    v39 = _mm_cvtps_epi32(_mm_add_ps(_mm_mul_ps(v33, (__m128)xmmword_1402062F0), v37));
    a3[12] = _mm_packs_epi32(v39, v39).m128i_u64[0];
    v40 = _mm_sub_ps(*v22, v22[14]);
    v41 = _mm_sub_ps(v22[2], v22[12]);
    v42 = _mm_sub_ps(v22[4], v22[10]);
    v43 = _mm_sub_ps(v22[6], v22[8]);
    v44 = _mm_mul_ps(_mm_add_ps(_mm_add_ps(_mm_add_ps(v40, v41), v42), v43), (__m128)xmmword_1402062D0);
    v45 = _mm_mul_ps(_mm_add_ps(v40, v43), (__m128)xmmword_1402062C0);
    v46 = _mm_mul_ps(_mm_add_ps(v41, v42), (__m128)xmmword_140206320);
    v47 = _mm_mul_ps(_mm_add_ps(v40, v42), (__m128)xmmword_140206290);
    v48 = _mm_mul_ps(_mm_add_ps(v41, v43), (__m128)xmmword_140206300);
    v49 = _mm_cvtps_epi32(_mm_add_ps(_mm_add_ps(_mm_add_ps(_mm_mul_ps(v40, (__m128)xmmword_1402062E0), v45), v47), v44));
    a3[2] = _mm_packs_epi32(v49, v49).m128i_u64[0];
    v50 = _mm_cvtps_epi32(_mm_add_ps(_mm_add_ps(_mm_add_ps(_mm_mul_ps(v41, (__m128)xmmword_140206330), v46), v48), v44));
    a3[6] = _mm_packs_epi32(v50, v50).m128i_u64[0];
    v51 = _mm_cvtps_epi32(_mm_add_ps(_mm_add_ps(_mm_add_ps(_mm_mul_ps(v42, (__m128)xmmword_140206310), v46), v47), v44));
    a3[10] = _mm_packs_epi32(v51, v51).m128i_u64[0];
    v52 = _mm_cvtps_epi32(_mm_add_ps(_mm_add_ps(_mm_add_ps(_mm_mul_ps(v43, (__m128)xmmword_140206280), v45), v48), v44));
    a3[14] = _mm_packs_epi32(v52, v52).m128i_u64[0];
    ++v22;
    ++a3;
  }
  while ( v25 );
}

sub_1402060F4 - Quantization procedure

__int64 __fastcall sub_1402060F4(__m128 *a1, __m128 *a2, const __m128i *a3, __m128i *a4)//Quantization procedure
{
  __m128 v7; // xmm4
  __int64 result; // rax
  __m128 v9; // xmm5
  __m128 v10; // xmm5
  int v11; // ecx
  __m128i si128; // xmm2
  __m128 v13; // xmm1
  __m128 v14; // xmm2

  v7 = (__m128)_mm_slli_epi32((__m128i)-1i64, 0x1Fu);
  result = 0x3F800000i64;
  v9 = (__m128)_mm_cvtsi32_si128(0x3F800000u);
  v10 = _mm_shuffle_ps(v9, v9, 0);
  v11 = 8;
  do
  {
    --v11;
    si128 = _mm_load_si128(a3);
    v13 = _mm_mul_ps(_mm_cvtepi32_ps(_mm_srai_epi32(_mm_unpacklo_epi16(si128, si128), 0x10u)), *a1);
    v14 = _mm_mul_ps(_mm_cvtepi32_ps(_mm_srai_epi32(_mm_unpackhi_epi16(si128, si128), 0x10u)), a1[1]);//multiplication with quantization table values
    *a4 = _mm_packs_epi32(
            _mm_cvttps_epi32(_mm_add_ps(v13, _mm_mul_ps(_mm_or_ps(_mm_and_ps(v13, v7), v10), *a2))),
            _mm_cvttps_epi32(_mm_add_ps(v14, _mm_mul_ps(_mm_or_ps(_mm_and_ps(v14, v7), v10), a2[1]))));//conversion to integers with rounding up I guess
    a1 += 2;
    a2 += 2;
    ++a3;
    ++a4;
  }
  while ( v11 );
  return result;
}

sub_1401FF5A0 - Huffman coding procedure

__int64 __fastcall sub_1401FF5A0(__int64 a1, __int64 a2, int a3, __int16 *a4)//Huffman coding procedure
{
  int v6; // r12d
  __int64 v8; // rsi
  unsigned int v9; // ebx
  __int64 v10; // rcx
  __int64 v11; // r8
  __int64 v12; // rdx
  unsigned __int64 v13; // rax
  int *v14; // r12
  unsigned __int64 v15; // rax
  int v16; // esi
  unsigned __int16 *v17; // rdi
  __int64 v18; // rax
  int v19; // r13d
  unsigned __int64 v20; // rbx
  int v21; // ebx
  unsigned int v23; // [rsp+60h] [rbp+18h]

  v6 = *a4 - *(__int16 *)(a2 + 2i64 * a3);
  v8 = 5i64 * a3;
  v23 = *a4;
  *(_WORD *)(a2 + 2i64 * a3) = *a4;
  if ( v6 )
  {
    _BitScanReverse(&v9, abs32(v6));
    v10 = (int)++v9 + 16i64 * *(unsigned __int8 *)(a1 + 10i64 * a3 + 4550);
    sub_1401FEEB0(a2, *(unsigned __int16 *)(a1 + 4 * v10 + 1488), *(unsigned __int16 *)(a1 + 4 * v10 + 1490));//output to stream code of number of bits of non-zero AC coefficient
    v11 = v9;//number of bits of AC coefficient
    v12 = ((1 << v9) - 1) & (unsigned int)(v6 + (v6 >> 31));//AC coefficient
  }
  else
  {
    v13 = (unsigned __int64)*(unsigned __int8 *)(a1 + 10i64 * a3 + 4550) << 6;//(AC Huffman table number depending on channel) * 64
    v11 = *(unsigned __int16 *)(v13 + a1 + 1490);//number of bits of code to encode zero AC coefficient from AC Huffman table
    v12 = *(unsigned __int16 *)(v13 + a1 + 1488);//code to encode zero AC coefficient from AC Huffman table
  }
  sub_1401FEEB0(a2, v12, v11);//output to stream
  v14 = (int *)&unk_1402906C4;
  v15 = (unsigned __int64)*(unsigned __int8 *)(a1 + 2 * v8 + 4551) << 10;//(DC Huffman table number depending on channel) * 1024
  v16 = 0;
  v17 = (unsigned __int16 *)(v15 + a1 + 1616);//DC Huffman table by channel
  do
  {
    v18 = *v14;
    v19 = a4[v18];
    if ( a4[v18] )
    {
      if ( v16 >= 16 )
      {
        v20 = (unsigned __int64)(unsigned int)v16 >> 4;
        v16 += -16 * ((unsigned int)v16 >> 4);
        do
        {
          sub_1401FEEB0(a2, v17[480], v17[481]);//output to stream serial code 0xF0 (16 zero coefficients)
          --v20;
        }
        while ( v20 );
      }
      _BitScanReverse((unsigned int *)&v21, abs32(v19));
      ++v21;
      sub_1401FEEB0(
        a2,
        v17[2 * (v21 | (unsigned __int64)(16 * v16))],
        v17[2 * (v21 | (unsigned __int64)(16 * v16)) + 1]);//output to stream pair code of number of bits of non-zero coefficient and number of zero coefficients before it
      sub_1401FEEB0(a2, ((1 << v21) - 1) & (unsigned int)(v19 + (v19 >> 31)), (unsigned int)v21);//output to stream DC coefficient
      v16 = 0;
    }
    else
    {
      ++v16;
    }
    ++v14;
  }
  while ( (__int64)v14 < (__int64)&unk_1402907C0 );
  if ( !v19 )
    sub_1401FEEB0(a2, *v17, v17[1]);//output to stream code 0x00 (EOB - End of block)
  return v23;
}

Jul 09 '25 10:07 Mickommic

Excuse me Wunkolo, but could you get the data from DispatchInfo from the DecompressJssfToken procedure from offset 1488 to 3664. 2176 bytes in total. There should be Huffman tables there. Maybe that's what we need.

Jul 09 '25 13:07 Mickommic

These are 8x8 block encoding procedures with ForwardDCT, Quantization and Huffman coding

Thanks for that, I'm finding more implementation-details now and am finding that it is doing a similar jpeg-like implementation with the nuance that almost all arithmetic is sustained as 16-bit arithmetic, including all the look-up tables and constants and such which diverges from the usual JPEG implementations you would find online. The implementation writes SYSTEMAX JFIF Encoder Ver.2.0 when writing to a regular 8-bit jpeg file and shares a lot of common functions and tables before doing a final 8-bit conversion. The main jssf decoding thread dispatch seems to be at 0x140204570 for me(Preview.2024.08.14).

Jul 09 '25 14:07 Wunkolo

Excuse me Wunkolo, but could you get the data from DispatchInfo from the DecompressJssfToken procedure from offset 1488 to 3664. 2176 bytes in total. There should be Huffman tables there. Maybe that's what we need.

There are four huffman tables it seems. They alternate in size by 32-bytes and 179-bytes.

The ones found in DispatchInfo seem to be written to by a function that reads from static program memory and then "expands" it.

Here is the raw data from program memory

unsigned char HuffmanLut0[32] =
{
  0x00, 0x00, 0x01, 0x05, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x02, 
  0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x00, 
  0x00, 0x00
};

unsigned char HuffmanLut2[32] =
{
  0x01, 0x00, 0x03, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 
  0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x02, 
  0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x00, 
  0x00, 0x00
};


unsigned char HuffmanLut1[179] =
{
  0x10, 0x00, 0x02, 0x01, 0x03, 0x03, 0x02, 0x04, 0x03, 0x05, 
  0x05, 0x04, 0x04, 0x00, 0x00, 0x01, 0x7D, 0x01, 0x02, 0x03, 
  0x00, 0x04, 0x11, 0x05, 0x12, 0x21, 0x31, 0x41, 0x06, 0x13, 
  0x51, 0x61, 0x07, 0x22, 0x71, 0x14, 0x32, 0x81, 0x91, 0xA1, 
  0x08, 0x23, 0x42, 0xB1, 0xC1, 0x15, 0x52, 0xD1, 0xF0, 0x24, 
  0x33, 0x62, 0x72, 0x82, 0x09, 0x0A, 0x16, 0x17, 0x18, 0x19, 
  0x1A, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x34, 0x35, 0x36, 
  0x37, 0x38, 0x39, 0x3A, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 
  0x49, 0x4A, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5A, 
  0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x73, 0x74, 
  0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x83, 0x84, 0x85, 0x86, 
  0x87, 0x88, 0x89, 0x8A, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 
  0x98, 0x99, 0x9A, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 
  0xA9, 0xAA, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 
  0xBA, 0xC2, 0xC3, 0xC4, 0xC5, 0xC6, 0xC7, 0xC8, 0xC9, 0xCA, 
  0xD2, 0xD3, 0xD4, 0xD5, 0xD6, 0xD7, 0xD8, 0xD9, 0xDA, 0xE1, 
  0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xF1, 
  0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA
};

unsigned char HuffmanLut3[179] =
{
  0x11, 0x00, 0x02, 0x01, 0x02, 0x04, 0x04, 0x03, 0x04, 0x07, 
  0x05, 0x04, 0x04, 0x00, 0x01, 0x02, 0x77, 0x00, 0x01, 0x02, 
  0x03, 0x11, 0x04, 0x05, 0x21, 0x31, 0x06, 0x12, 0x41, 0x51, 
  0x07, 0x61, 0x71, 0x13, 0x22, 0x32, 0x81, 0x08, 0x14, 0x42, 
  0x91, 0xA1, 0xB1, 0xC1, 0x09, 0x23, 0x33, 0x52, 0xF0, 0x15, 
  0x62, 0x72, 0xD1, 0x0A, 0x16, 0x24, 0x34, 0xE1, 0x25, 0xF1, 
  0x17, 0x18, 0x19, 0x1A, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x35, 
  0x36, 0x37, 0x38, 0x39, 0x3A, 0x43, 0x44, 0x45, 0x46, 0x47, 
  0x48, 0x49, 0x4A, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 
  0x5A, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x73, 
  0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x82, 0x83, 0x84, 
  0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x92, 0x93, 0x94, 0x95, 
  0x96, 0x97, 0x98, 0x99, 0x9A, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 
  0xA7, 0xA8, 0xA9, 0xAA, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 
  0xB8, 0xB9, 0xBA, 0xC2, 0xC3, 0xC4, 0xC5, 0xC6, 0xC7, 0xC8, 
  0xC9, 0xCA, 0xD2, 0xD3, 0xD4, 0xD5, 0xD6, 0xD7, 0xD8, 0xD9, 
  0xDA, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 
  0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA
};

This data is expanded and written into the dispatch info the address range you mentioned:

    CopyExpand_i8_i16((__int16 *)&result_1->HuffmanLUT0, HuffmanLut0); // 0x5D0(1488)
    CopyExpand_i8_i16((__int16 *)&result_1->HuffmanLUT1, HuffmanLut1);
    CopyExpand_i8_i16((__int16 *)&result_1->HuffmanLUT2, HuffmanLut2);
    CopyExpand_i8_i16((__int16 *)&result_1->HuffmanLUT3, HuffmanLut3); // 0xA50(2640)

The function that expands this data into the actual LUT that is used when compressing(this is likely the DHT table being expanded):

    __int64 __fastcall CopyExpand_i8_i16(__int16 *Dst16, unsigned __int8 *Src8)
{
  int v2; // r8d
  __int64 i; // rbx
  unsigned __int8 *Next; // r11
  __int16 v6; // dx
  __int64 result; // rax
  __int64 v8; // r9
  __int64 v9; // r10

  v2 = 0;
  i = 0i64;
  Next = Src8 + 17;
  v6 = 1;
  do
  {
    result = Src8[i + 1];
    v8 = 0i64;
    v9 = result;
    if ( Src8[i + 1] )
    {
      do
      {
        result = Next[v8++];
        Dst16[2 * result] = v2++;
        Dst16[2 * result + 1] = v6;
      }
      while ( v8 < v9 );
    }
    ++i;
    Next += v9;
    ++v6;
    v2 *= 2;
  }
  while ( i < 16 );
  return result;
}

DispatchInfo from the DecompressJssfToken procedure from offset 1488 to 3664

The actual memory layout of this region

000005D0 HuffmanLUT0     dw 32 dup(?)
00000610 HuffmanLUT2     dw 32 dup(?)
00000650 HuffmanLUT1     dw 512 dup(?)
00000A50 HuffmanLUT3     dw 512 dup(?)

Jul 09 '25 14:07 Wunkolo

These tables are exactly the same as those defined in the JPEG specification. I tried decoding the bitstream again and found that only one MCU-row is decoded. I did not notice this yesterday. Apparently, the images are encoded with restart intervals (DRI marker) but without outputting RSTm markers to the bitstream. On the left is a thumbnail from sai, on the right is from sai2. Snapshot

Jul 09 '25 17:07 Mickommic

This might relate to how jssf data seems to be broken up into regular 4096-byte blocks like in the current testbed code, if you are not already handling that. I'd have to reach the point of implementing the jpeg stream decoding before I can verify.

Jul 09 '25 18:07 Wunkolo

Hello! Finally figured out JSSF. Before the bitstream of each MCU-row there is its length (2 bytes). That is: length, bitstream of the first MCU-row, length, bitstream of the second MCU-row and so on.

Sai2ReadThumb

function Sai2ReadThumb(Dst: Pointer; Stream: TStream; Info: PiSai2Info; PixelStep, LineStep: Inteher): Boolean;
label Quit;
const                        //id  hsf/vsf  qt  id  hsf/vsf  qt  id  hsf/vsf  qt
  FFC0: array[0..8] of Byte = (1,   $11,    0,  2,   $11,    1,  3,   $11,    1);

                            //id   ht   id   ht   id   ht
  FFDA: array[0..5] of Byte = (1,   0,   2,  $11,  3,  $11);

(* uncomment this if your jpeg decoder doesn't know about the existence of typical Huffman tables
                            //     DHT     length
  FFC4: array[0..419] of Byte = ($ff,$c4,  1,$a2,
//   id ||          16 code length counters               ||   codes, the number of which is the sum of the length counters
      0,{}0, 1, 5, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, {}0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
      1,{}0, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, {}0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
    $10,{}0, 2, 1, 3, 3, 2, 4, 3, 5, 5, 4, 4, 0, 0, 1,125,{}$01,$02,$03,$00,$04,$11,$05,$12,$21,$31,$41,$06,$13,$51,$61,$07,$22,$71,$14,$32,$81,$91,$A1,$08,$23,$42,$B1,$C1,$15,$52,$D1,$F0,$24,$33,$62,$72,$82,$09,$0A,$16,$17,$18,$19,$1A,$25,$26,$27,$28,$29,$2A,$34,$35,$36,$37,$38,$39,$3A,$43,$44,$45,$46,$47,$48,$49,$4A,$53,$54,$55,$56,$57,$58,$59,$5A,$63,$64,$65,$66,$67,$68,$69,$6A,$73,$74,$75,$76,$77,$78,$79,$7A,$83,$84,$85,$86,$87,$88,$89,$8A,$92,$93,$94,$95,$96,$97,$98,$99,$9A,$A2,$A3,$A4,$A5,$A6,$A7,$A8,$A9,$AA,$B2,$B3,$B4,$B5,$B6,$B7,$B8,$B9,$BA,$C2,$C3,$C4,$C5,$C6,$C7,$C8,$C9,$CA,$D2,$D3,$D4,$D5,$D6,$D7,$D8,$D9,$DA,$E1,$E2,$E3,$E4,$E5,$E6,$E7,$E8,$E9,$EA,$F1,$F2,$F3,$F4,$F5,$F6,$F7,$F8,$F9,$FA,
    $11,{}0, 2, 1, 2, 4, 4, 3, 4, 7, 5, 4, 4, 0, 1, 2,119,{}$00,$01,$02,$03,$11,$04,$05,$21,$31,$06,$12,$41,$51,$07,$61,$71,$13,$22,$32,$81,$08,$14,$42,$91,$A1,$B1,$C1,$09,$23,$33,$52,$F0,$15,$62,$72,$D1,$0A,$16,$24,$34,$E1,$25,$F1,$17,$18,$19,$1A,$26,$27,$28,$29,$2A,$35,$36,$37,$38,$39,$3A,$43,$44,$45,$46,$47,$48,$49,$4A,$53,$54,$55,$56,$57,$58,$59,$5A,$63,$64,$65,$66,$67,$68,$69,$6A,$73,$74,$75,$76,$77,$78,$79,$7A,$82,$83,$84,$85,$86,$87,$88,$89,$8A,$92,$93,$94,$95,$96,$97,$98,$99,$9A,$A2,$A3,$A4,$A5,$A6,$A7,$A8,$A9,$AA,$B2,$B3,$B4,$B5,$B6,$B7,$B8,$B9,$BA,$C2,$C3,$C4,$C5,$C6,$C7,$C8,$C9,$CA,$D2,$D3,$D4,$D5,$D6,$D7,$D8,$D9,$DA,$E2,$E3,$E4,$E5,$E6,$E7,$E8,$E9,$EA,$F2,$F3,$F4,$F5,$F6,$F7,$F8,$F9,$FA
  );
*)
var
  i, j, k: Integer;
  c, h, w: SmallInt;
  buf: PiByteArray;
  pjd: PiJpegDec;
  tid: TiImageData;
begin
  Result:= False;

  if (Dst = nil) or (Stream = nil)
  or (Info = nil) or (Info^.ThumbOffset = 0) then Exit;

  Stream.Position:= Info^.ThumbOffset;
  Stream.Read(i, 4);//width
  Stream.Read(j, 4);//height
  Stream.Read(k, 4);

  if TiFourChar(k) <> 'jssf' then Exit;

  Stream.Read(w, 2);//width again
  Stream.Read(h, 2);//height again

  if (w <> i) or (h <> j) then Exit;

  Stream.Read(c, 2);//channels

  pjd:= nil;
  buf:= nil;
  k:= Info^.ThumbLength - (Ord(c > 1) * 64) - 82;
  i:= k + 180;
(* uncomment this if your jpeg decoder doesn't know about the existence of typical Huffman tables
  Inc(i, SizeOf(FFC4));
*)
  try
    GetMem(buf, i);
  except
    Exit;
  end;

  pjd:= JpegDecodeInit;

  if pjd = nil then goto Quit;

  PWord(buf)^:= $d8ff;//SOI - Start of image
  PWord(@buf^[2])^:= $dbff;//DQT - Define quantization table
  PWord(@buf^[4])^:= WordSwap(Ord(c > 1) * 65 + 67);
  buf^[6]:= 0;//id of first QT for luma

  if Stream.Read(buf^[7], 64) < 64 then goto Quit;//reading first QT

  i:= 71;

  if c > 1 then begin
    buf^[71]:= 1;//id of second QT for chroma

    if Stream.Read(buf^[72], 64) < 64 then goto Quit;//reading second QT

    Inc(i, 65);
  end;
(* uncomment this if your jpeg decoder doesn't know about the existence of typical Huffman tables
  MoveMem(@buf^[i], @FFC4, SizeOf(FFC4));//DHT - Define Huffman table
  Inc(i, SizeOf(FFC4));
*)
  PWord(@buf^[i])^:= $c0ff;//SOFn - Start of frame   (SOF0 - Baseline DCT)
  PWord(@buf^[i + 2])^:= WordSwap(c * 3 + 8);
  buf^[i + 4]:= 8;//bps
  PWord(@buf^[i + 5])^:= WordSwap(h);
  PWord(@buf^[i + 7])^:= WordSwap(w);
  buf^[i + 9]:= c;//channels

  MoveMem(@buf^[i + 10], @FFC0, c * 3);
  Inc(i, c * 3 + 10);

  PInteger(@buf^[i])^:= $400ddff;//DRI - Define restart interval
  PWord(@buf^[i + 4])^:= WordSwap((w + 7) shr 3);
  Inc(i, 6);

  PWord(@buf^[i])^:= $daff;//SOS - Start of Scan
  PWord(@buf^[i + 2])^:= WordSwap(c * 2 + 6);
  buf^[i + 4]:= c;//channels

  MoveMem(@buf^[i + 5], @FFDA, c * 2);
  Inc(i, c * 2 + 5);

  PInteger(@buf^[i])^:= $3f00;//Ss, Se, Ah/Al
  Inc(i, 3);
  w:= 0;

  if Stream.Read(c, 2) < 2 then goto Quit;//length of the bit stream of the first MCU-row

  repeat
    j:= c + 2;

    if Stream.Read(buf^[i], j) < j then Break;//reading jpeg bitstream (entropy)

    Inc(i, j);
    c:= PWord(@buf^[i - 2])^;//length of the bit stream of the next MCU-row
    PWord(@buf^[i - 2])^:= ((w and 7) shl 8) or $d0ff;//RSTm - Restart with modulo 8 count “m”
    Inc(w);
    Dec(h, 8);
  until h <= 0;

  PWord(@buf^[i - 2])^:= $d9ff;//EOI - End of image

//At this point feed buf to your preferred jpeg decoder

  k:= i;

  tid.Component[0]:= Dst;
  tid.Component[1]:= @PiByteArray(Dst)^[1];
  tid.Component[2]:= @PiByteArray(Dst)^[2];
  tid.SampleStep[0]:= PixelStep;
  tid.SampleStep[1]:= PixelStep;
  tid.SampleStep[2]:= PixelStep;
  tid.LineStep[0]:= LineStep;
  tid.LineStep[1]:= LineStep;
  tid.LineStep[2]:= LineStep;
  tid.BitDepth:= 8;

  repeat
    j:= JpegDecode(@buf^[k - i], i, pjd);

    if pjd^.Scan.DecodedLines.Count <> 0 then begin
      JpegDecodeComponents(tid, pjd);

      Inc(Inteher(tid.Component[0]), LineStep shl 3);
      Inc(Inteher(tid.Component[1]), LineStep shl 3);
      Inc(Inteher(tid.Component[2]), LineStep shl 3);
    end;
  until (j < 0) or (j > 4) or (i <= 0);

  Result:= True;
Quit:
  JpegDecodeFree(pjd);
  FreeMem(buf);
end;

Jul 10 '25 16:07 Mickommic

Thanks for this insight, learned a lot about jpeg encoding from this. I was able to extract jssf data directly into a jpeg stream by adding the appropriate markers and metadata around the entropy bytes. The next step would be to decode these thumbnails locally and get a regular RGB8-buffer so that it can be more generally useful. Currently, simpler jpeg decoders like stb_image do not properly support the RSTm marker and won't decode it properly. But it might be viable to implement RSTm support and sustain the changes locally.

Here is an extracted thumbnail from a sai2 document(restructured jssf data into a regular jpeg). gal-thumbnail Because it uses RSTm markers, a lot of jpg-decoders are not able to view or decode this image properly(except maybe the compliant jpeg-decoder that your web browser uses if you can see the image). If you try to save this image, you'll find some image-viewers will not be able to open this image properly such as Windows Photos or MSPaint or XnView, but this is still a great starting point now that both of the primary thumbnail formats have been mapped out.

Jul 11 '25 07:07 Wunkolo

Thanks for this insight, learned a lot about jpeg encoding from this. I was able to extract jssf data directly into a jpeg stream by adding the appropriate markers and metadata around the entropy bytes. The next step would be to decode these thumbnails locally and get a regular RGB8-buffer so that it can be more generally useful. Currently, simpler jpeg decoders like stb_image do not properly support the RSTm marker and won't decode it properly. But it might be viable to implement RSTm support and sustain the changes locally.

Here is an extracted thumbnail from a sai2 document(restructured jssf data into a regular jpeg). Because it uses RSTm markers, a lot of jpg-decoders are not able to view or decode this image properly(except maybe the compliant jpeg-decoder that your web browser uses if you can see the image). If you try to save this image, you'll find some image-viewers will not be able to open this image properly such as Windows Photos or MSPaint or XnView, but this is still a great starting point now that both of the primary thumbnail formats have been mapped out.

Hello! You can restore RSTm markers to the correct positions. See my code above.

Jul 11 '25 08:07 Mickommic

You can restore RSTm markers to the correct positions. See my code above.

There might be some confusion here. I am already emitting RSTm markers the same way: https://github.com/Wunkolo/libsai/blob/a3ae3e49d9fa315bca33a19f92bbb338002e1856/samples/Thumbnail-Sai2.cpp#L239-L255

The problem is that there are Jpeg decoders that do not support RSTm.

Jul 11 '25 15:07 Wunkolo

The problem is that there are Jpeg decoders that do not support RSTm.

RSTm is a standard marker. All JFIF compliant decoders should support it. I analyzed your image above, it is missing Huffman tables (DHT marker). Very few decoders know about typical Huffman tables. You must insert tables if you want any decoder to be able to decode these images.

Jul 11 '25 17:07 Mickommic

That was it. stb_image is now also able to decode the images successfully after interpreting the DHT tables properly. Thanks! Just need to turn all this into proper library code now.

Jul 12 '25 02:07 Wunkolo

I've deciphered both jssf and dpcm formats now and can extract both thumbnail formats from .sai2 documents now. It seems both formats are used depending on the resolution of the original canvas data when it was first saved(the threshold seems to be about 512x512):

explorer_ocfefD1v5s

With that the "Initial Sai2 support" scope of this PR will be concluded before I move on to other pushes to apply this same insight to extracting layer-data. I'll be merging this PR soon so that I can make a new build of SaiThumbs.

Nov 22 '25 06:11 Wunkolo