hasten
hasten copied to clipboard
Fails on binary field data
MySQL allows storing "binary" data types. mysqldump
outputs these as byte strings. This results in invalid UTF-8 sequences, which cause the following stack trace when the dump file is piped through hasten
:
$ time zcat mysqldump.sql.gz | hasten | mysql test_hasten
/usr/local/rvm/gems/ruby-1.9.3-p484/gems/hasten-1.0.0/lib/hasten/command.rb:6:in `match': invalid byte sequence in UTF-8 (ArgumentError)
from /usr/local/rvm/gems/ruby-1.9.3-p484/gems/hasten-1.0.0/lib/hasten/command.rb:6:in `match'
from /usr/local/rvm/gems/ruby-1.9.3-p484/gems/hasten-1.0.0/lib/hasten/command.rb:6:in `complete?'
from /usr/local/rvm/gems/ruby-1.9.3-p484/gems/hasten-1.0.0/lib/hasten/dump.rb:36:in `parse_command'
from /usr/local/rvm/gems/ruby-1.9.3-p484/gems/hasten-1.0.0/lib/hasten/dump.rb:13:in `execute'
from /usr/local/rvm/gems/ruby-1.9.3-p484/gems/hasten-1.0.0/bin/hasten:16:in `<top (required)>'
from /usr/local/rvm/gems/ruby-1.9.3-p484/bin/hasten:23:in `load'
from /usr/local/rvm/gems/ruby-1.9.3-p484/bin/hasten:23:in `<main>'
from /usr/local/rvm/gems/ruby-1.9.3-p484/bin/ruby_executable_hooks:15:in `eval'
from /usr/local/rvm/gems/ruby-1.9.3-p484/bin/ruby_executable_hooks:15:in `<main>'
I imagine it would be rather difficult to identify these binary strings and treat the separately in order to avoid this. I guess the other alternative would be to load the file as ASCII 8-bit, but then I'm not sure what would happen with valid UTF-8 characters...
Bit of a pickle, not sure if there is a good solution, but wanted to at least get the issue out there. I was trying it out on one of our dumps to see if it would speed things up on restore, which we sorely need, but this is unfortunately a blocker for using hasten.
Thanks for the report! If you could make available an example dump that contained such a binary string, I could tackle it. Or if you want to issue a PR I would gladly review it.
Test file available at https://gist.github.com/HerbCSO/2b4213cdc1e81df121e8 (uuencoded). I'd love to do a PR, but as I mentioned I'm not really sure how to approach it.
To reproduce, once you have the file uudecoded, run cat binary_data.sql | hasten | mysql binary_test
. It should produce the following error:
/Users/carsten.dreesbach/.rvm/gems/ruby-1.9.3-p551/gems/hasten-1.0.1/lib/hasten/command.rb:6:in `match': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/carsten.dreesbach/.rvm/gems/ruby-1.9.3-p551/gems/hasten-1.0.1/lib/hasten/command.rb:6:in `match'
from /Users/carsten.dreesbach/.rvm/gems/ruby-1.9.3-p551/gems/hasten-1.0.1/lib/hasten/command.rb:6:in `complete?'
from /Users/carsten.dreesbach/.rvm/gems/ruby-1.9.3-p551/gems/hasten-1.0.1/lib/hasten/dump.rb:36:in `parse_command'
from /Users/carsten.dreesbach/.rvm/gems/ruby-1.9.3-p551/gems/hasten-1.0.1/lib/hasten/dump.rb:13:in `execute'
from /Users/carsten.dreesbach/.rvm/gems/ruby-1.9.3-p551/gems/hasten-1.0.1/bin/hasten:16:in `<top (required)>'
from /Users/carsten.dreesbach/.rvm/gems/ruby-1.9.3-p551/bin/hasten:23:in `load'
from /Users/carsten.dreesbach/.rvm/gems/ruby-1.9.3-p551/bin/hasten:23:in `<main>'
+1
+1
+1
If anyone is interested I've modified it here: https://github.com/thirtysixthspan/hasten/pull/2 It replaces characters causing issues with '?'
adding mysqldump option: --hex-blob seems to be a workaround for this issue at the moment.
I've made a bash script here: https://github.com/meme-lord/hasten_py/blob/master/hasten.sh That basically does what hasten does. hasten is a lot slower because it parses all the SQL for no real reason. This also has no errors with encoding or binary data.
Interesting script, although that isn't quite what hasten does.
Hasten removes the key definitions on new tables, and then adds them back in at the end via alters.
Your script is interesting, and I think the auto-commit 0 angle is interesting, but for very large databases I'm not sure one commit at the very end is a great idea.
I am probably going to write my own script that disables auto-commit, and then has a configurable number of INSERT lines to add COMMITs after.
To provide a happy medium between: "COMMIT after every single INSERT" and "COMMIT only once at the end!"
Which I think should speed up large table innodb INSERTs considerably, without hasten's sort of complex approach.
That's an excellent idea. I can try implement it the next time I need to do a large import. I didn't spot the key definition thing last time I read through the source. Do you think grouping inserts would also give a speedup? Some SQL dumps I've seen have one insert statement per row.
Do you think grouping inserts would also give a speedup?
It definitely does, but I think that is best handled by mysqldump's --extended-insert flag :
Write INSERT statements using multiple-row syntax that includes several VALUES lists. This results in a smaller dump file and speeds up inserts when the file is reloaded.