Fuel icon indicating copy to clipboard operation
Fuel copied to clipboard

Try to compress strings

Open GoogleCodeExporter opened this issue 10 years ago • 36 comments

we can try to create a separate cluster for strings and try to compress them...

Original issue reported on code.google.com by marianopeck on 29 May 2012 at 1:54

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

I'd like that very much. We have a graph in which we also store xml data. Not 
much but it accumulates. There's a lot of whitespace for instance (which is 
visible when looking at the .fuel file in an editor) so there's a lot of room 
for compression. Probably even a very fast / low rate compression could reduce 
file sizes grately if there is a lot of text to be serialized.

Original comment by [email protected] on 20 Feb 2013 at 10:38

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

I took that idea from last night and implemented a quick deflate / inflate 
logic, so that now all that xml data is stored in a byte array instead. That 
gives me fuel files of 12.8MB to 35.9MB and an image size of 47.3MB to 70.4MB.

Granted, if we were to integrate something like this into fuel, we'd have to 
use some mechanism like a threshold, because
1. we would want as little deflate operations as possible
2. fragments below a certain size actually grow when deflated

So something like (very roughly):

ByteString>>serializeOn: anEncoder
    | stream |
    stream := (ZLibWriteStream on: ByteArray new)
        nextPutAll: self asByteArray;
        close.
    "delegate serialization to ByteArray"
    stream encodedStream contents serializeOn: anEncoder

Original comment by [email protected] on 21 Feb 2013 at 9:19

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Nice!

I didn't understand this part:
"12.8MB to 35.9MB and an image size of 47.3MB to 70.4MB."

So the fuel file is reduced from 35MB to 12MB using this?
Why the image size is reduced?

What about the time performance? (#timeToRun is enough for me!)

Original comment by [email protected] on 21 Feb 2013 at 10:27

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

BTW 
Mariano: what happened with you lz4 work?
http://marianopeck.wordpress.com/2012/11/16/lz4-binding-for-pharo/

Original comment by [email protected] on 21 Feb 2013 at 10:32

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Yes. To give you an estimate: the model contains 3068 documents with a source 
string length average of 10463 characters. the average length of the compressed 
strings is 2922 bytes.

See below.

Oops, I think you misunderstood me. I didn't implement anything in Fuel for 
this. I simply compressed all the xml strings in my model and compared the 
sizes of the fuel files and images with / without compression. But I might try 
to implement something like this quickly in Fuel (like in the example) and 
compare the runtimes. I'll let you know.

Original comment by [email protected] on 21 Feb 2013 at 11:52

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

I will answer quickly (then with more details). THis issue was ment to add 
compression to the whole string cluster, that means, to compress ALL strings 
together (in one compression). I found this quite complicated and never found 
time to really implement it. 

The other possibility is to compress EACH string...but of course, this gives 
way smaller ratios. But for particular cases, like the bioinformatics, this is 
still very useful. But here you don't need anything special from Fuel, just use 
the substitution hook.

Static way:

ByteString >> fuelAccept: aGeneralMapper


  (BioParser tokenizeFasta: o) second isDNASequence  
        ifTrue: [

          aGeneralMapper visitSubstitution: self by: self zipped
onRecursionDo: [ super fuelAccept: aGeneralMapper ].
          ]

        ifFalse: [ super fuelAccept: aGeneralMapper ]
     ^  



Dynamic way:


objectToSerialize := Array with: 'hello' with: (FileStream readOnlyFileNamed: 
'GGA28.fa') contents.
threshold := 1000.


FileStream forceNewFileNamed: 'demo.fuel' do: [ :aStream |
   aSerializer := FLSerializer newDefault.
   aSerializer analyzer 
       when: [ :o | o isString and: [ o size > threshold and: [ o isZipped ] ] ]
       substituteBy: [ :o | o zipped ].

   aSerializer         
       serialize: objectToSerialize
       on: aStream binary ].

result := FileStream oldFileNamed: 'demo.fuel' do: [ :aStream |
       aMaterialization := FLMaterializer newDefault
materializeFrom: aStream binary.
zippedStrings := aMaterialization objects select: [:o | o isString and: [ o 
isZipped not ]].
unzippedStrings := zippedStrings collect: [:o | o unzipped ].
zippedStrings elementsExchangeIdentityWith: unzippedStrings.
aMaterialization root ].



And yes, I recommend to use LZ4 for this since it gives a good enough 
compression in a very very small time. 

Original comment by marianopeck on 21 Feb 2013 at 11:59

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Hm, what was the problem, do you recall? Because at first glance it seems 
pretty straigt forward:

stream := ZLibWriteStream on ByteArray new.
cluster objects do: [ :string |
    stream nextPutAll: string asByteArray ].
stream close.

byteToSerialize := stream encodedStream contents. 


Or something like this...

Original comment by [email protected] on 21 Feb 2013 at 12:10

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Hi Max. The problem was related to the "indexes"...In other words, while the 
graph was being visited during analysis/serialization, you record certain 
offsets/indexes/position for the visited strings...then you compress, so the 
cluster is smaller. Then during materialization what happened is that when I 
need to materialize a string, it was difficult because it was kind of that the 
indexes were shifted. 

Maybe there is a workaround....

Original comment by marianopeck on 21 Feb 2013 at 12:15

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Ah yes, I see. Maybe there's a need for pre-analysis hooks. But as you wrote, 
most of this can be done manually, especially if you know that you have large 
amounts of uniform data.

Original comment by [email protected] on 21 Feb 2013 at 12:19

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Hi max. Needing a pre-analysis hook is not the big problem. The big complexity 
is how to be able to compress all the strings of the cluster together rather 
than compressing each string (as they do in bioinformatics and as I posted 
above).

But if you want to give it a try Max, please be my guest. Sometimes new blood 
just works better :)

Original comment by marianopeck on 21 Feb 2013 at 9:16

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

I've been thinking about this and I'd like to give it a try. Might be a while 
though, since this is really not a pressing issue. 

Original comment by [email protected] on 22 Feb 2013 at 7:48

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Please go ahead. And let me know how it goes :)   Basically the idea is to be 
able to compress/uncompress the strings of the cluster all together. And the 
same for symbols.
For the first step, don't worry for the compress, use whatever.  Then, if it 
works, I will give it a try with LZ4 :)
That would be supercool. 

Original comment by marianopeck on 23 Feb 2013 at 4:04

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

[deleted comment]

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Hi hacked together a very rough version (really a proof of concept only) with 
an arbitrary encoding strategy. Load all the attachments and try it with:

o := Dictionary new
    add: 1 -> 'bar';
    add: 2 -> { 'foo'. 'baz' };
    yourself.

FLSerializer serialize: o toFileNamed: 'foo'.
FLMaterializer materializeFromFileNamed: 'foo'


Seems to work :)

Note that I simply chose the way of least resistance by subclassing a cluster. 
There's probably a better way.

Original comment by [email protected] on 24 Feb 2013 at 5:38

Attachments:

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Hi Max. I took a look to the code. It looks quite similar to what I did some 
time ago. But I don't know why yours seem to work while mine didn't :)

What about doing the following (for TRUNK, not 1.9):

1) Add the FLByteStringCluster
2) Make both, string and symbol use  ByteStringCluster.
3) Make ByteStringCluster delegate to a CompressorStrategy to which we send the 
string and it answers the string to really write in the stream.
4) We do a concrete sublcass of CompressorStrategy called NoCompressionStrategy 
and we use that by default. It will just answer the same string.
5) we make a class side setter to  ByteStringCluster to set others compressor 
and we write a ZLib compressor subclass. 
6) then we can do a LZ4 compressor subclass :)
7) Create one subclass of FLStreamStrategy for compressor type. This way, we 
can run all tests for a particular compressor and see if it works. Look 
previous versions of FuelCompression and you can take  alook to FLGZipStrategy. 

What do you think?

Thanks Max, this was pretty coooooool!!!!

Original comment by marianopeck on 24 Feb 2013 at 7:17

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

I'm just lucky :)

Sounds good to me.

My pleasure!

Original comment by [email protected] on 24 Feb 2013 at 8:00

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Excellent! sounds very good.
Then I will take a look. 
For now, I think that all in "trunk" (ie. the main repo), can be in 1.9... so 
can we add this code to trunk after 1.9 is released? or alternatively put it in 
FuelExperiments?

Original comment by [email protected] on 24 Feb 2013 at 8:24

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

If I start to implement something I'll put it into the experiments repo.

Original comment by [email protected] on 24 Feb 2013 at 9:00

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Be careful because Squeaksource got read only. And FuelExperiments is in SS. We 
shhould create a FuelExperiments repo in SmalltalkHub and migrate it to 
there....

Original comment by marianopeck on 25 Feb 2013 at 12:19

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

I already committed to experiments. The only thing you can't do is create new 
projects. 

Original comment by [email protected] on 25 Feb 2013 at 7:10

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Cool, I'm anxious for benchmark it!

Original comment by [email protected] on 25 Feb 2013 at 10:12

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Max, you are right, I am always confused about that :)
I am also very ansious to test. Maybe we cna just use the benchs/samples for 
strings/symbols. 
I also wanna test with LZ4. It is quite easy to test in fact the readme 
explains everything: http://smalltalkhub.com/#!/~marianopeck/LZ4/


Original comment by marianopeck on 25 Feb 2013 at 10:56

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Martin made an interesting suggestion yesterday. The compression could also be 
made pluggable by passing different streams to FLSerializer (like we already 
try experimentally with GZip).

Although I like the idea for its simplicity, after having given it some thought 
I don't think it's flexible enough.

1. The user would have to provide the correct stream if he doesn't use the 
class side methods
2. a compressing stream would compress *all* contents which would consume a lot 
of time and slow down Fuel
3. if we'd use a stream wrapper to make a selection of objects that we want to 
compress (like strings) and objects that we don't want to compress, that would 
be feasible but put the responsibility in the hands of the wrong objects (in my 
opinion). Neither streams nor en- / decoders should be concerned about the data 
they write but only with the writing itself.

I therefore will continue working with Mariano's proposal for the 
implementation for now.

Original comment by [email protected] on 27 Feb 2013 at 10:08

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

Well, I also agree with my idea hahahah (thanks God hahah). What Martin 
proposes is already provided out of the box since it has nothing to do with 
Fuel itself: just pass around a compression stream and that's all. In fact, 
that's what FuelCompression used to do :)

But for the reasons you mention above, I think giving a try with our other 
alternative is worth!

Original comment by marianopeck on 27 Feb 2013 at 11:26

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

I don't disagree! in the contrary, i think it's cool to experiment these ideas.

Original comment by [email protected] on 6 Mar 2013 at 3:52

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

OK. I followed Max idea and I found a few problems and possible improvements. 

- There was a bug with strings bigger than 255 characters because we used only 
1 byte to store size. We now use either one byte or 4. If someone could improve 
this even more, then cool. Most strings will fit in 1 byte, so that's cool.

- The [0000] mark was unncessary and it meant 4 extra bytes PER string. 

- Now it supports both, strings and symbols. 

Original comment by marianopeck on 2 Oct 2013 at 10:20

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

btw, I commited to http://smalltalkhub.com/mc/Fuel/Experiments/main

Original comment by marianopeck on 2 Oct 2013 at 10:21

GoogleCodeExporter avatar Mar 24 '15 16:03 GoogleCodeExporter

This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.

stale[bot] avatar May 18 '21 22:05 stale[bot]

@tinchodias @marianopeck We should totally do this! This can be such a big improvement, and with 4.0.0 we can make it configurable very easily.

theseion avatar Oct 30 '21 17:10 theseion

Wow, I don't remember too much about this feature but it's great to have the discussion from 2013. I re-read it not. Would you recover the old code, or implement from scratch?

tinchodias avatar Nov 05 '21 20:11 tinchodias