rubyXL
rubyXL copied to clipboard
Performance Issue on writing, and trying to fix it
I open this thread related to what I've posted #245
Because initially the problem was on import/update of a excel file, and in my case on export.
See progress here: https://github.com/anykeyh/rubyXL
The problem
- Exporting an excel of around ~100k cells take 25 minutes on sidekiq. Gathering the information is good (few seconds), but rendering using RubyXL is not, and take 99% of the time.
After analyzing the situation, the main problem comes from the reusability of the stylesheets, font and border styles.
All comes from this method in ooxml_object.rb:
def ==(other)
other.is_a?(self.class) &&
obtain_class_variable(:@@ooxml_attributes).all? { |k, v| self.send(v[:accessor]) == other.send(v[:accessor]) } &&
obtain_class_variable(:@@ooxml_child_nodes).all? { |k, v| self.send(v[:accessor]) == other.send(v[:accessor]) }
end
This check if a style/font/border etc... already exists and return the ID of it for linking in the excel file. The main problem is this computation is long, as all fields are tested, then all subnodes and all fields of subnodes are tested also...
Theses tests are present in convenience_methods.rb, for example:
def register_new_font(new_font, old_xf)
new_xf = old_xf.dup
# v--- HERE for fonts
new_xf.font_id = fonts.find_index { |x| x == new_font } # Reuse existing font, if it exists
new_xf.font_id ||= fonts.size # If this font has never existed before, add it to collection.
fonts[new_xf.font_id] = new_font
new_xf.apply_font = true
new_xf
end
And performance keep going down as more style exists in your workbook.
Here is an example of excel which take ~8 seconds to generate, and which should not. Ruby-prof give good insight of the problem:
require 'rubyXL'
require 'benchmark'
begin
require 'ruby-prof'
rescue LoadError
puts "This example case is using ruby-prof gem..."
exit
end
def generate_my_excel
@workbook = RubyXL::Workbook.new
@workbook.worksheets.clear
worksheet = @workbook.add_worksheet("TEST")
x = 0
y = 0
bm = Benchmark.realtime do
200.times do |x|
worksheet.add_cell(x, y, "#{x}")
worksheet.sheet_data[x][y].change_border(:top, 'thin')
worksheet.sheet_data[x][y].change_border(:right, 'thin')
worksheet.sheet_data[x][y].change_border(:left, 'thin')
worksheet.sheet_data[x][y].change_border(:bottom, 'medium')
worksheet.sheet_data[x][y].change_font_bold true
x+=1
end
end
puts "Time for header: #{bm}ms"
x = 0
y = 1
50.times do |row|
200.times do |cols|
x+=1
value = rand*100
worksheet.add_cell(x, y, value)
if value > 0.8
worksheet.sheet_data[x][y].change_border(:top, 'thin')
worksheet[x][y].change_fill('00ff00')
elsif value > 0.5
worksheet.sheet_data[x][y].change_border(:top, 'thin')
worksheet[x][y].change_fill('ffff00')
else
worksheet.sheet_data[x][y].change_border(:top, 'thin')
worksheet[x][y].change_fill('ff0000')
end
end
y+=1
x = 0
end
end
RubyProf.start
puts "Generation 200x50 excel with style..."
generate_my_excel
result = RubyProf.stop
printer = RubyProf::GraphPrinter.new(result)
printer.print(STDOUT)
output = "write_complex_table.xlsx"
@workbook.write(output)
My idea
I want to implement a test by hash, and an auto hash update on modification of the fields and child nodes.
Eg.
def some_accessor= x
@hash ^= @some_accessor.hash #Remove the X-OR hash merge of the current version of the object
@some_accessor = x #Setter
@hash ^= x.hash # Merge the hash of the object
end
# ...
def hash
@hash
end
# ...
def ==(other)
other.is_a?(self.class) && self.hash == other.hash
end
So far I've implemented a naive method which is working more or less, with some bugs on some styles. The performance are astonishing (30 sec against 20 minutes... lol), but not yet ready for production.
The problem of hash collision also is not taken in consideration. Hash are 64 bit long and chance of collision on a small set like stylesheets and fonts is less than 10^-15. In short, you can win billions of time in the lottery before it happens. However, I still need to figure out if the hash function is good enough and spread uniformly. For example, an object with only two nil value will have a hash equals to zero (the two nil.hash canceling each others in the xor operator...)
The second problem is when a field/subnode mutate, it should update the hash of the parent, which is not the case now, leading to issues on finding the styles and wrong styles rendering.
If you have any better idea than this hashing method, I would be pleased to hear from you !
Update
I've fixed the code, I wasn't able to make hash update propagation to parent works properly, but not so much time to give on this project so I just flag the parent as hash_dirty and recompute the hash on purpose.
Because some sub-elements of the hash are already computed, the overload in terms of CPU usage remains low enough. I got a 2900% improvement on my production report generation (25 minutes to 51 seconds...), which is I guess not bad 😄
For avoiding hash collision and negation, the hash is computed by the combination of the ruby#hash method of the name of the parameter and the value
E.g. for the combination color/gray, it computes "color".hash * "gray".hash
nil in different fields give different hashes and will not negate anymore; I'm not a math superstar so someone with a better knowledge in the domain should give a look and do fix if needed.
I'm waiting discussion below to see if I make a pull request, as it can have undiscovered effect for now. You can give a look at my fork for testing it: https://github.com/anykeyh/rubyXL
Right now your question is like "I'm trying to dig a Panama Channel with spoons, and it's way too slow, you should fix your spoons!" While you should rather be using a backhoe.
- If you want to set style on multiple cells in a rectangular array at once, you should look at
ColumnRangeandRowobjects. - If you want to change multiple borders at once, you should not be using convenience methods (they are there for, as the name implies, your convenience only). You should create your own style and apply that style to all cells.
In short, tell me what you are trying to achieve, and I will tell you how to do it efficiently.
Well thank for your message, but I'm a bit confused now.
Right now your question is like "I'm trying to dig a Panama Channel with spoons, and it's way too slow, you should fix your spoons!" While you should rather be using a backhoe.
Because if I read correctly what you say it's more or less like "go use another tool", eg. another gem; using rubyXL managing by myself the styles is like meh. Because if I really like rubyXL, it's for its spoon-like API syntax, making thing fast and effective. This gem is really like how Ruby is made for, simple and easy to use but impressive under the hood.
However, I'm not sure in this case denying the performance problem will resolve it. Indeed the convenience methods are here for convenience, but:
1/ They are thoses who are explained in the README.md.
2/ They works everywhere, which is not the case of Row or ColumnRange (And I'll give a look for the second one, as I didn't explored it yet !).
In short, tell me what you are trying to achieve, and I will tell you how to do it efficiently.
I've a big excel sheet with time ranges, kpis etc etc... of employees (500+ employees) over 30 days, and compound-style cells, related to the value of the content BUT also to the position of the cell in this matrix. Plus some sub-header which groups the lines per projects and per sub-tasks and so on.
Roughly 85'635 cells, and roughly 25 minutes for rendering the Excel, only 20 seconds of SQL & data-juggling, and 8 seconds of zipping, which let 24 minutes and 28 seconds of cells writing.
With this fix, I'm more or less around 1 minute 30, or somewhere like ~60 seconds of cells writing now.
Here is a more complete test-case I written as I think will be more talkative: https://gist.github.com/anykeyh/aa5241ba5b0c5095c8f8d8bf97833cbc
TL;DR; it just render 4 times an excel with more or less style. 5000 rows, only ~8 differents styles. (3 for the header, 1 for sub header, 4 for columns I think). Then do it again after injecting the hashing equality method.
| Styles in my sheet | Vanilla | Hash |
|---|---|---|
| No styles | 0.723s | 1.051s |
| Few styles | 2.359s | 2.523s |
| Half styles | 6.408s | 2.909s |
| All styles | 11.663s | 3.96s |
Each time it will render the excel while enabling more style, from no style at all to the 8 styles activated. The problem is not it getting slower, it's it's slowing down exponentially.
Also editing style manually was indeed my first option, but I thought it would negate the added value of RubyXL, and handling all the differents "compound" styles ( value of content + column type ) would have been an hassle of combinations to manage for me. If I want to go low level, I would go to another gems more focused on performances.
And to answer your comment on #268 about this issue, this is definitely the root of all the performances issues which are opened here, not only by me.
Most of them comes with excel editing, because people thinks they can open a beautiful and full of styles excel with this gem, update the cells in their specifics data then save the full-filled excel; except this will trigger this performance problem at the moment they use any cell.change_xxx convenient method.
And for the code readability with the class macro, I'm agreeing and I would be pleased to rewrite it using define_method to make it more clean, if needed.
Regards,
Yacine
OK, let me reiterate it like this:
RubyXLwas written with the goal to access data from existing spreadsheets first and foremost. Over time, it grew in capabilities, so it's secondary goal is to load an existing spreadsheet, slightly modify it and save without breaking. "Create a complex spreadsheet from scratch", while possible, was never an intended goal of this gem, although admittedly, there's hope to eventually achieve that. (And given how complex OOXML structures are, I am not anticipating that to happen in this century, given that I'm working on it alone in my free time.)- I oftentimes receive requests, like, "can you make the gem do X?". The problem is, everyone wants a different X, while I have to keep the community interests in mind. So if somebody comes up with a way to make X quick, it slows down W, Y and Z. In your case, your patch severely affects the readability of the code, and also increases the memory footprint. (Which is, in case of big spreadsheets that I have to handle reading of, quite important).
- While looking at your submission, I identified the issue with
validate_worksheetusing wrong equality operator. I'm submitting a patch and releasing a new gem version; see if it helps your case.
I identified the issue with validate_worksheet using wrong equality operator. I'm submitting a patch and releasing a new gem version; see if it helps your case.
Just for posterity in case anyone else is wondering whether/where this was fixed, but looks like it was here: https://github.com/weshatheleopard/rubyXL/commit/6252df10613a439d3cf40e212557714969048a3b