construct icon indicating copy to clipboard operation
construct copied to clipboard

JPEG parser

Open akvadrako opened this issue 13 years ago • 11 comments

This parses EXIF and JFIF files. this is my first construct, and while making it I noticed I was missing a couple things:

  1. a PascalString that includes the size bytes in the length - this is fairly common in protocols. That is what the "- 2" below is for.
  2. Embed() on a Switch() doesn't work very well - because it discards the non-struct substructs.
  3. FastReader() below improves speed by about 10X.

class FastReader(Construct): def _parse(self, stream, context): return stream.read()

def _build(self, obj, stream, context):
    stream.write(obj)

SegBody = Struct(None, UBInt16('size'), Field('data', lambda ctx: ctx['size'] - 2), )

Seg = Struct('seg', Literal('\xff'), Byte('kind'), Switch('body', lambda c: c['kind'], { SOS: FastReader('data'), },
default = Embed(SegBody), ) )

JPEG = Struct('jpeg', Literal('\xff\xd8'), GreedyRange(Seg), )

akvadrako avatar May 22 '11 12:05 akvadrako

and formatted more nicely:

class FastReader(Construct):
    def _parse(self, stream, context):
        return stream.read()

    def _build(self, obj, stream, context):
        stream.write(obj)

SegBody = Struct(None,
        UBInt16('size'),
        Field('data', lambda ctx: ctx['size'] - 2),
    )   

Seg = Struct('seg',
        Literal('\xff'),
        Byte('kind'),
        Switch('body', lambda c: c['kind'],
            {
                SOS: FastReader('data'),
            },  
            default = Embed(SegBody),
            )
        )   

JPEG = Struct('jpeg',
        Literal('\xff\xd8'),
        GreedyRange(Seg),
        )

akvadrako avatar May 22 '11 12:05 akvadrako

Hi,

I'm not sure about the FastReader, as I still don't grok that section of Construct yet.

There is a PascalString, in construct.macros, which takes a length_field as a kwarg. An example usage:

>>> from construct import PascalString, UBInt16
>>> s = PascalString("hurp", length_field=UBInt16("length"))
>>> s.parse("\x00\x05Hello")
'Hello'

Thanks for your comments. Let me know if you have any patches you wish to contribute.

MostAwesomeDude avatar May 23 '11 21:05 MostAwesomeDude

Hi - the issue with the PascalString is that the length field doesn't include the bytes that make up the length field. In several protocols, we get fields like this, 0x0004babe, so the length (4) include the first 2 bytes.

akvadrako avatar May 24 '11 07:05 akvadrako

@akvadrako: this could be done like so

>>> s=PascalString("data", ExprAdapter(ULInt16("length"), 
...    lambda val, ctx: val + 2, lambda val, ctx: val - 2))
>>> s.parse("\x05\x00helloxxxx")
'hel'
>>> s.build("foo")
'\x05\x00foo'

on the other hand, your straight forward solution is better.

as per your FastReader class -- i would consider it bad design. i understand you simply wanted to read everything in, but it's not predictable (can't tell how much it will read or write) and thus not symmetric. for instance, the following construct would work only in one direction:

Struct("a", 
    FastReader("blob"),
    UBInt32("x"),
)

you would be able to build anything you want, but you'll never be able to parse it back.

tomerfiliba avatar May 27 '11 19:05 tomerfiliba

I suggested a variant to PascalString because length+data is common in network protocols and apparently JPEG too.

FastReader is the best we can do with construct's internals. Your example wouldn't work with RepeatUntil and Range either. I'm not sure it should - since constructs need to know about future constructs and you'll get ambiguity:

Struct("a", 
    GreedyRange("b"),
    GreedyRange("c"),
)

Probably better to make a FastReadUntil('BOUNDARY').

akvadrako avatar May 28 '11 10:05 akvadrako

Length + data is perfectly serviced by PascalString; the case where the length of the length is included in the length is actually rather uncommon though. Maybe a new String subclass is needed for it.

As far as "fast" reading, why not examine other optimizations first? There are optimization opportunities in Construct core, I think.

MostAwesomeDude avatar May 28 '11 18:05 MostAwesomeDude

@MostAwesomeDude: no need to subclass, it would be much simpler to just define a InclusivePascalString "macro" that takes care of subtracting/adding the size of the length field from the length.

@akvadrako: your "fast" reader isn't any faster than the plain old Field except that it doesn't check the length. since this greedy construct can only appear once at the end of a data structure, it don't suppose it would make much difference in terms of speed. also, my tests back in the day showed that psycho can speed up parsing by a tenfold.

on the other hand, as you said, it poses a problem of breaking the symmetry between parsing and building... but i think it's inherent to the pattern and there isn't any real solution.

tomerfiliba avatar May 28 '11 18:05 tomerfiliba

it's much faster - construct is unusable for parsing JPEG images without it - where 99% of the data is an unbounded blob at the end of the file.

akvadrako avatar May 28 '11 21:05 akvadrako

if you're using GreedyRange, then yes, it would be much faster. i was talking about Field. on the other hand, Field must have a predetermined length, so it's not suitable for your purpose.

what do you mean, though, that 99% of the file is a blob? doesn't it have an internal structure? if so, i assume you have no real interest in it, so you may want to use OnDemand, so it will actually be read only when asked for.

tomerfiliba avatar May 29 '11 13:05 tomerfiliba

Yes, you are correct. OnDemand doesn't help though, because it requires a known length.

akvadrako avatar May 29 '11 14:05 akvadrako

well, i just had an idea: assuming you're working on a file/stringIO, you can write a construct that simply returns the remaining length till EOF. e.g.

p=stream.tell()
stream.seek(0, 2)
p2=stream.tell()
stream.seek(p)
return p2-p

and then you could combine it with Field and OnDemand.

tomerfiliba avatar May 30 '11 09:05 tomerfiliba