cascalog icon indicating copy to clipboard operation
cascalog copied to clipboard

Lines ending with a delimiter confuse hfs-delimited

Open methylene opened this issue 11 years ago • 19 comments

I've noticed I get an exception when I use something like

((hfs-delimited "hfds:///a.txt") ?x ?y) 

as a generator, and a.txt contains some lines where the second token is empty, so that the line ends in \t\n. In this case, I get a "operation added wrong number of fields" exception.

I'm not sure whether it's a cascading or cascalog problem.

The cascalog version is 2.0.0, the hadoop version is 1.1.2.

methylene avatar Jan 13 '14 22:01 methylene

Acually the lines I used had more tokens than 2. I hope it can be reproduced anyway.

methylene avatar Jan 13 '14 22:01 methylene

Also, is there a way to disable quoting entirely on an hfs-delimited source tap? At the moment I just use some unlikely quote character like :quote "\u0003" but still have to strip that character from the input, along with delim chars

methylene avatar Jan 16 '14 09:01 methylene

Not sure on this one - you might check out the underlying implementation in Cascading? This would be a nice feature to add.

sritchie avatar Jan 16 '14 14:01 sritchie

@methylene you can pass in :strict? false option into hfs-delimited and your missing field at the end would just be null. Regarding quote, I don't think you can disable it.

I'm going to close this issue for now. Feel free to re-open or msg me if this doesn't work for you.

Quantisan avatar Jan 27 '14 02:01 Quantisan

Hi Paul, thanks for the tip, that's definitly helpful. But I think it's only a workaround. Maybe there is a misunderstanding? In my input data there is no "missing" field, but an empty one at the end of some line, right before the line break. In particular, there is no line in my input that doesn't have the correct number of delimiters. Using :strict? false will resolve the error, but it also disables a reasonable sanity check on each line. So yes I think this should be reopened.

methylene avatar Jan 31 '14 20:01 methylene

@methylene delimited expects an ISO legit delimited file. Your best bet is to just read it in with hfs-textline then parse it manually. Does that answer your question?

Quantisan avatar Feb 05 '14 23:02 Quantisan

Please explain, how a csv line with the correct number of tokens, but the last token empty, violates an ISO standard?

methylene avatar Feb 16 '14 08:02 methylene

You mean, adding quote chars would make the problem go away?

methylene avatar Feb 16 '14 09:02 methylene

So yes, I have yet to try

"a""t"b\t"" instead of a\tb\t

as input line, but I'm afraid that's not what you get as an output when using hfs-delimited or hfs-textline as a sink

methylene avatar Feb 16 '14 09:02 methylene

When using hfs-delimited or hfs-textline as a sink tap, empty strings are encoded just the same way as nulll values, that can also be a problem

methylene avatar Feb 16 '14 09:02 methylene

Is this test case corresponding to your problem? https://github.com/Quantisan/cascalog/commit/cac0119730b14035162a6e624f7edd4ad73353dc

Quantisan avatar Feb 26 '14 23:02 Quantisan

I think there's something wrong with that test case. Yes it looks as if it confirms the issue. But the following fails, too:

(deftest delimited-empty-notlast-test
  (fact
    (io/with-fs-tmp [_ tmp]
      (?- (hfs-textline tmp) ;; write line
          [["Proin\t\thendrerit"]])
      (fact "Test hfs-delimited where last element is empty"
        (<- [?a ?b !c]
            ((hfs-delimited tmp :delimiter "\t") ?a ?b !c)) =>
        (produces [["Proin" nil "hendrerit"]])))))

methylene avatar Mar 10 '14 17:03 methylene

Sorry, it should have been !b, not ?b. The test describes the problem well. The following test succeeds, but fails without the (remove (comp empty? last)) line:

(deftest delimited-empty-token-test
  (let [lines (->> [["0" "1" "2"] [""  "1" "2"]
                    ["0" ""  "2"] [""  ""  "2"]
                    ["0" "1" "" ] [""  "1" "" ]
                    ["0" ""  "" ] [""  ""  "" ]]
                   (remove (comp empty? last))
                   (vec))]
    (io/with-fs-tmp [_ tmp]
      (?- (hfs-textline tmp) ;; write line
          lines) 
      (fact (str "Test with nil at various positions") 
            (<- [!a !b !c]
                ((hfs-delimited tmp :delimiter "\t") !a !b !c)) =>
                (produces (mapv (partial mapv not-empty) lines))))))

methylene avatar Mar 10 '14 20:03 methylene

By the way, I haven't used midje before, but I wonder why all tests in cascalog.more-taps-test have two (fact) forms, the outer one having only one argument and no "=>" symbol?

methylene avatar Mar 10 '14 20:03 methylene

I have this problem as well. My file uses commas and quotes, so stuff like

"col","another",""

I don't get any error, just no result at all. It's also not trapped.

This is the code I use to parse it

(defn log-source [path fields]
  (let [cfields (vec (map (partial str "?") fields))
        collect (mapfn [& values]
                  [(zipmap fields values)])
        log (hfs-delimited path
                   :outfields cfields
                   :delimiter ","
                   :compress? true
                   :strict? false)]
    (prn cfields)
    (<- [?map]
        (:trap (stdout))
        (:distinct false)
        (log :>> cfields)
        (collect :<< cfields :> ?map)
      )))

So if I understand correctly, ther is a testcase, but no fix?

I'm not sure what's going on here. It seems to matter which of these two lines I pick.

    (log :> ?field ?another _) ; works
    (log :> ?field ?another ?broken)
    (log :> ?field ?another !works) ; AHA!

pepijndevos avatar Mar 25 '14 10:03 pepijndevos

The cascading guys run their tests against Cascalog, so you should be able to include it in your project.clj and override the cascalog version. If it works, I'd love a pull req to bump the default cascading version.

Thanks, Sam

Pepijn de Vos mailto:[email protected] March 25, 2014 4:21 AM

It seems this is fixed upstream? https://groups.google.com/d/msg/cascading-user/9vy5H0gY8sk/kVM-YRPdycgJ

Can I use that with Cascalog?

— Reply to this email directly or view it on GitHub https://github.com/nathanmarz/cascalog/issues/228#issuecomment-38548752.

Sam Ritchie (@sritchie) Paddleguru Co-Founder 703.863.8561 www.paddleguru.com http://www.paddleguru.com/ Twitter http://twitter.com/paddleguru// Facebook http://facebook.com/paddleguru

sritchie avatar Mar 25 '14 13:03 sritchie

I deleted my comment because I realized they where talking about cascading 2.1, while we're already using 2.5, so the thread is irrelevant.

I also fixed my problem by using a Nullable variable, so the empty last column doesn't cause it to be excluded. Not sure if @methylene still has problesm.

pepijndevos avatar Mar 25 '14 14:03 pepijndevos

@sritchie I've tried putting the [cascading/cascading-hadoop "2.6.0-wip-2"] into the cascalog-more-taps/project.clj in the d76a6b12083c11915b525988fa1ca1ad30063b36 commit, appended the test above to cascalog-more-taps/test/cascalog/more-taps-test.clj, then ran the test using lein test :only cascalog.more-taps-test/delimited-empty-token-test. But the issue seems to be still there.

methylene avatar Apr 08 '14 20:04 methylene

It's not working, but here's a commit to add the test and bump the cascading version: 71d22c1618428d8de18c7112d257f1555e7029ef

methylene avatar Apr 13 '14 20:04 methylene