Fuel
Fuel copied to clipboard
Try to compress strings
we can try to create a separate cluster for strings and try to compress them...
Original issue reported on code.google.com by marianopeck on 29 May 2012 at 1:54
I'd like that very much. We have a graph in which we also store xml data. Not
much but it accumulates. There's a lot of whitespace for instance (which is
visible when looking at the .fuel file in an editor) so there's a lot of room
for compression. Probably even a very fast / low rate compression could reduce
file sizes grately if there is a lot of text to be serialized.
Original comment by [email protected] on 20 Feb 2013 at 10:38
I took that idea from last night and implemented a quick deflate / inflate
logic, so that now all that xml data is stored in a byte array instead. That
gives me fuel files of 12.8MB to 35.9MB and an image size of 47.3MB to 70.4MB.
Granted, if we were to integrate something like this into fuel, we'd have to
use some mechanism like a threshold, because
1. we would want as little deflate operations as possible
2. fragments below a certain size actually grow when deflated
So something like (very roughly):
ByteString>>serializeOn: anEncoder
| stream |
stream := (ZLibWriteStream on: ByteArray new)
nextPutAll: self asByteArray;
close.
"delegate serialization to ByteArray"
stream encodedStream contents serializeOn: anEncoder
Original comment by [email protected] on 21 Feb 2013 at 9:19
Nice!
I didn't understand this part:
"12.8MB to 35.9MB and an image size of 47.3MB to 70.4MB."
So the fuel file is reduced from 35MB to 12MB using this?
Why the image size is reduced?
What about the time performance? (#timeToRun is enough for me!)
Original comment by [email protected] on 21 Feb 2013 at 10:27
BTW
Mariano: what happened with you lz4 work?
http://marianopeck.wordpress.com/2012/11/16/lz4-binding-for-pharo/
Original comment by [email protected] on 21 Feb 2013 at 10:32
Yes. To give you an estimate: the model contains 3068 documents with a source
string length average of 10463 characters. the average length of the compressed
strings is 2922 bytes.
See below.
Oops, I think you misunderstood me. I didn't implement anything in Fuel for
this. I simply compressed all the xml strings in my model and compared the
sizes of the fuel files and images with / without compression. But I might try
to implement something like this quickly in Fuel (like in the example) and
compare the runtimes. I'll let you know.
Original comment by [email protected] on 21 Feb 2013 at 11:52
I will answer quickly (then with more details). THis issue was ment to add
compression to the whole string cluster, that means, to compress ALL strings
together (in one compression). I found this quite complicated and never found
time to really implement it.
The other possibility is to compress EACH string...but of course, this gives
way smaller ratios. But for particular cases, like the bioinformatics, this is
still very useful. But here you don't need anything special from Fuel, just use
the substitution hook.
Static way:
ByteString >> fuelAccept: aGeneralMapper
(BioParser tokenizeFasta: o) second isDNASequence
ifTrue: [
aGeneralMapper visitSubstitution: self by: self zipped
onRecursionDo: [ super fuelAccept: aGeneralMapper ].
]
ifFalse: [ super fuelAccept: aGeneralMapper ]
^
Dynamic way:
objectToSerialize := Array with: 'hello' with: (FileStream readOnlyFileNamed:
'GGA28.fa') contents.
threshold := 1000.
FileStream forceNewFileNamed: 'demo.fuel' do: [ :aStream |
aSerializer := FLSerializer newDefault.
aSerializer analyzer
when: [ :o | o isString and: [ o size > threshold and: [ o isZipped ] ] ]
substituteBy: [ :o | o zipped ].
aSerializer
serialize: objectToSerialize
on: aStream binary ].
result := FileStream oldFileNamed: 'demo.fuel' do: [ :aStream |
aMaterialization := FLMaterializer newDefault
materializeFrom: aStream binary.
zippedStrings := aMaterialization objects select: [:o | o isString and: [ o
isZipped not ]].
unzippedStrings := zippedStrings collect: [:o | o unzipped ].
zippedStrings elementsExchangeIdentityWith: unzippedStrings.
aMaterialization root ].
And yes, I recommend to use LZ4 for this since it gives a good enough
compression in a very very small time.
Original comment by marianopeck on 21 Feb 2013 at 11:59
Hm, what was the problem, do you recall? Because at first glance it seems
pretty straigt forward:
stream := ZLibWriteStream on ByteArray new.
cluster objects do: [ :string |
stream nextPutAll: string asByteArray ].
stream close.
byteToSerialize := stream encodedStream contents.
Or something like this...
Original comment by [email protected] on 21 Feb 2013 at 12:10
Hi Max. The problem was related to the "indexes"...In other words, while the
graph was being visited during analysis/serialization, you record certain
offsets/indexes/position for the visited strings...then you compress, so the
cluster is smaller. Then during materialization what happened is that when I
need to materialize a string, it was difficult because it was kind of that the
indexes were shifted.
Maybe there is a workaround....
Original comment by marianopeck on 21 Feb 2013 at 12:15
Ah yes, I see. Maybe there's a need for pre-analysis hooks. But as you wrote,
most of this can be done manually, especially if you know that you have large
amounts of uniform data.
Original comment by [email protected] on 21 Feb 2013 at 12:19
Hi max. Needing a pre-analysis hook is not the big problem. The big complexity
is how to be able to compress all the strings of the cluster together rather
than compressing each string (as they do in bioinformatics and as I posted
above).
But if you want to give it a try Max, please be my guest. Sometimes new blood
just works better :)
Original comment by marianopeck on 21 Feb 2013 at 9:16
I've been thinking about this and I'd like to give it a try. Might be a while
though, since this is really not a pressing issue.
Original comment by [email protected] on 22 Feb 2013 at 7:48
Please go ahead. And let me know how it goes :) Basically the idea is to be
able to compress/uncompress the strings of the cluster all together. And the
same for symbols.
For the first step, don't worry for the compress, use whatever. Then, if it
works, I will give it a try with LZ4 :)
That would be supercool.
Original comment by marianopeck on 23 Feb 2013 at 4:04
[deleted comment]
Hi hacked together a very rough version (really a proof of concept only) with
an arbitrary encoding strategy. Load all the attachments and try it with:
o := Dictionary new
add: 1 -> 'bar';
add: 2 -> { 'foo'. 'baz' };
yourself.
FLSerializer serialize: o toFileNamed: 'foo'.
FLMaterializer materializeFromFileNamed: 'foo'
Seems to work :)
Note that I simply chose the way of least resistance by subclassing a cluster.
There's probably a better way.
Original comment by [email protected] on 24 Feb 2013 at 5:38
Attachments:
Hi Max. I took a look to the code. It looks quite similar to what I did some
time ago. But I don't know why yours seem to work while mine didn't :)
What about doing the following (for TRUNK, not 1.9):
1) Add the FLByteStringCluster
2) Make both, string and symbol use ByteStringCluster.
3) Make ByteStringCluster delegate to a CompressorStrategy to which we send the
string and it answers the string to really write in the stream.
4) We do a concrete sublcass of CompressorStrategy called NoCompressionStrategy
and we use that by default. It will just answer the same string.
5) we make a class side setter to ByteStringCluster to set others compressor
and we write a ZLib compressor subclass.
6) then we can do a LZ4 compressor subclass :)
7) Create one subclass of FLStreamStrategy for compressor type. This way, we
can run all tests for a particular compressor and see if it works. Look
previous versions of FuelCompression and you can take alook to FLGZipStrategy.
What do you think?
Thanks Max, this was pretty coooooool!!!!
Original comment by marianopeck on 24 Feb 2013 at 7:17
I'm just lucky :)
Sounds good to me.
My pleasure!
Original comment by [email protected] on 24 Feb 2013 at 8:00
Excellent! sounds very good.
Then I will take a look.
For now, I think that all in "trunk" (ie. the main repo), can be in 1.9... so
can we add this code to trunk after 1.9 is released? or alternatively put it in
FuelExperiments?
Original comment by [email protected] on 24 Feb 2013 at 8:24
If I start to implement something I'll put it into the experiments repo.
Original comment by [email protected] on 24 Feb 2013 at 9:00
Be careful because Squeaksource got read only. And FuelExperiments is in SS. We
shhould create a FuelExperiments repo in SmalltalkHub and migrate it to
there....
Original comment by marianopeck on 25 Feb 2013 at 12:19
I already committed to experiments. The only thing you can't do is create new
projects.
Original comment by [email protected] on 25 Feb 2013 at 7:10
Max, you are right, I am always confused about that :)
I am also very ansious to test. Maybe we cna just use the benchs/samples for
strings/symbols.
I also wanna test with LZ4. It is quite easy to test in fact the readme
explains everything: http://smalltalkhub.com/#!/~marianopeck/LZ4/
Original comment by marianopeck on 25 Feb 2013 at 10:56
Martin made an interesting suggestion yesterday. The compression could also be
made pluggable by passing different streams to FLSerializer (like we already
try experimentally with GZip).
Although I like the idea for its simplicity, after having given it some thought
I don't think it's flexible enough.
1. The user would have to provide the correct stream if he doesn't use the
class side methods
2. a compressing stream would compress *all* contents which would consume a lot
of time and slow down Fuel
3. if we'd use a stream wrapper to make a selection of objects that we want to
compress (like strings) and objects that we don't want to compress, that would
be feasible but put the responsibility in the hands of the wrong objects (in my
opinion). Neither streams nor en- / decoders should be concerned about the data
they write but only with the writing itself.
I therefore will continue working with Mariano's proposal for the
implementation for now.
Original comment by [email protected] on 27 Feb 2013 at 10:08
Well, I also agree with my idea hahahah (thanks God hahah). What Martin
proposes is already provided out of the box since it has nothing to do with
Fuel itself: just pass around a compression stream and that's all. In fact,
that's what FuelCompression used to do :)
But for the reasons you mention above, I think giving a try with our other
alternative is worth!
Original comment by marianopeck on 27 Feb 2013 at 11:26
I don't disagree! in the contrary, i think it's cool to experiment these ideas.
Original comment by [email protected] on 6 Mar 2013 at 3:52
OK. I followed Max idea and I found a few problems and possible improvements.
- There was a bug with strings bigger than 255 characters because we used only
1 byte to store size. We now use either one byte or 4. If someone could improve
this even more, then cool. Most strings will fit in 1 byte, so that's cool.
- The [0000] mark was unncessary and it meant 4 extra bytes PER string.
- Now it supports both, strings and symbols.
Original comment by marianopeck on 2 Oct 2013 at 10:20
btw, I commited to http://smalltalkhub.com/mc/Fuel/Experiments/main
Original comment by marianopeck on 2 Oct 2013 at 10:21
This issue has been automatically marked as stale because it has not had recent activity. It will remain open but will probably not come into focus. If you still think this should receive some attention, leave a comment. Thank you for your contributions.
@tinchodias @marianopeck We should totally do this! This can be such a big improvement, and with 4.0.0 we can make it configurable very easily.
Wow, I don't remember too much about this feature but it's great to have the discussion from 2013. I re-read it not. Would you recover the old code, or implement from scratch?