scalding
scalding copied to clipboard
Installation/getting started instructions outdated?
Hi,
I've been trying to install & run Scalding, but I've been running in to some issues... it looks as though some of the installation directions may be outdated. I could go through the wiki and try to fix things myself, but I'd prefer it if someone more knowledgable about the project did instead, as I don't want to add any incorrect information.
Working from https://github.com/twitter/scalding/wiki/Getting-Started I ran in to this error:
Johns-MacBook-Pro :: ~/tmp » git clone [email protected]:twitter/scalding.git -b develop
Cloning into 'scalding'...
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
However, git clone https://github.com/twitter/scalding.git did work... I assume github has changed their permissions scheme somehow so that the old clone command no longer works.
It also looks as though the installation instructions for scala & sbt may be out of date. The Getting Started page mentions both sbt 0.11 and sbt 0.12. This page seems to get me Scala 2.8.1, but my impression from looking at project/Build.scala is that that's not the preferred version (the development branch seems to want 2.9.3 and the master branch seems to want 2.9.2). In general, the information in "Using Scalding with other versions of Scala" seemed quite outdated; I had a hard time finding the source files/lines of code it was referring to.
The WordCountJob example from https://github.com/twitter/scalding/wiki/Getting-Started doesn't seem to work either... I get this error (w/ scala 2.9.3):
WordCountJob.scala:5: error: invalid escape character
.flatMap('line -> 'word) { line : String => line.split("\s+") }
^
one error found
(Bits of code from this example are also used elsewhere on the page... overall, seems likely the entire page could use a revamping.)
A couple other suggestions to make scalding easier to install and increase adoption:
- Maybe the master branch could be the default instead of the develop branch? It doesn't look to me as though the tests are all passing on the development branch, and it's the one I get when I run git clone https://github.com/twitter/scalding.git
- Maybe the readme could make it clear which versions of Scala work well with Scalding? It's awkward for me to have to read the source in project/Build.scala to figure out which version of Scala is preferred.
I've been looking at Scalding for only 2 minutes, quickly scanning the README
, and some issues regarding the first code example jumped out at me:
TypedPipe.from(TextLine(args("input")))
.flatMap { line => tokenize(line) }
.groupBy { word => word } // use each word for a key
.size // in each group, get the size
.write(TypedText.tsv[(String, Long)](args("output")))
- The
groupBy
would be more idiomatic if written as:groupBy(identity)
. - The
size
does not do what the comment suggests. Clearly this code example never worked because the next line, with thewrite
, would result in a compiler error.
I found a similar method in the examples
directory, which looks more reasonable:
class WordCountJob(args: Args) extends Job(args) {
TypedPipe.from(TextLine(args("input")))
.flatMap { line => line.split("\\s+") }
.map { word => (word, 1L) }
.sumByKey
// The compiler will enforce the type coming out of the sumByKey is the same as the type we have for our sink
.write(TypedTsv[(String, Long)](args("output")))
}
Can you explain why the comment is wrong (size does work that way, as I understand the comment). Also can you post the compiler error?
I don't see the error.
It is confusing to a lot of people who start with scalding that the methods on grouped things apply to each group:
see: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/typed/KeyedList.scala#L45
where almost all the methods on groups are defined.
@johnynek I probably posted too quickly, after only a brief glance at the docs. Should know better.
I just found Intro to Scalding Jobs on the wiki. Perhaps a link in the README from the WordCountJob
source code to the walkthrough might be helpful.
After reading the walkthrough, I realized that TypePipe.from
returns a stream. This size
method is unlike the Scala collections method of the same name in that it does not return a single Int
. Instead, this size
is a combinator that deals with streams, which makes sense given the nature of the library. I think that is what your most recent post was trying to tell me.
I expect that my comment about using identity
is probably correct. I'll need to actually run the example Scalding code to know for sure, and that is not something I know how to do yet.
Have not yet found an explanation for why two versions of WordCountJob
exist.
yes, you are correct that we could have written groupBy(identity)
. About the two ways to write it, it is a minor concern. There are often multiple ways to express a computation. Actually hard for me to say why you'd do one or the other here. groupBy
gives you access to more operations, sumByKey
gives you the common idiom of doing some Semigroup
reduction for each value.