cascading-cassandra
cascading-cassandra copied to clipboard
Sink Skip Task Bug & JSON Deserialisation to String
- Changed to prevent killing task on last modified
- added JSON deserialisation to String
- added dependancy for JSON jar
This is my first ever pull request, apologies if it's crap!
hi mat,
i have some issues :
- JSON serializing is client dependent : there are a whole bunch of different JSON libs, and cascading-cassandra shouldn't force any specific one
- getModifiedTime is used differently by sink/source : the value 0 has special meaning for sources (BaseFlow/getSourceModified) where it represents a resource which doesn't exist : i don't think we should return 0 for getModifiedTime when the Tap is sourcing, since it seems quite likely to break or change something else
- there should be a unit test which fails without the patch, and passes with the patch : i still don't understand when this problem occurs or otherwise : if i run a (hadoop) cascalog job which sinks to cascading-cassandra under a debugger, the CassandraTap/getModifiedTime method doesn't seem to get called at all. no doubt i don't understand because i don't know cascalog/cascading's internal mechanics well enough (the effects of local/hadoop java/cascalog etc), but in the absence of that understanding i would like a test which makes some sense of the complexity
tho maybe @ifesdjeen knows the cascading mechanics better than i do, and is happy with the change, in which case i'm happy to defer to his superior knowledge :)
I do think we should add custom pluggable serialisers, but I do not think it's a good idea to bring in another dependency... maybe we could add a configurable fallback that would receive an object and return bytebuffer, that way everyone can roll out his own deserialisation mechanism?
I would really really really appreciate a piece of Java code that reveals the problem though :) I would port it to test myself and we all will be happy :)
Let's make it that way: we roll out custom serialization via config, and @matlarsen shows us Java code that causes the problem in return. Deal? :)
sounds like a good deal to me :)
Hey
Agree with the JSON thing. Should never have put it in there.
I agree it should change depending on whether sink/source. I didn't realise 0 had a special meaning (makes sense it means missing though). I can't think of a sensible educated value to set it to however, 1? :o)
It doesn't get called at all for you? I don't understand what's going on then; it seems to me from my digging around the Cascading code that it must be called (ergo, it is checking if sink age > source age). The only case where it isn't (referring to the source here) is when the Run ID is null - maybe it has something to do with the nature of how cascading jobs are executed? I need to have a look at the planner code to try and find out why - I don't have the time to do this at the moment. When I get time I'll see if I can work my job down to a core path that causes it to fail (sorry atm it's epicly mad and I have to be sloppy) and try to nail that SOB in the planner.
Thanks for your comments and allowing me to contribute!
the special meaning of 0 for sources doesn't matter if we only set the value to 0 when sinking
getSourceModified() does not appear to be called at all for me : i've no idea why : it is possible that i have a problem with my debugging setup, though other breakpoints in code from the same jar are working fine... it's possible that my source version is wrong though : i'll dig deeper when i've got some time