training ampcamp6: parquet read & tachyon config bugs

ampcamp6: parquet read & tachyon config bugs

Open tranlm opened this issue 9 years ago • 1 comments

Running through the exercise code, here are some issues I found:

Data Exploration using Spark SQL page:

"parquetFile" has been deprecated and the resulting code should be changed to wikiData = sqlCtx.read.parquet("data/wiki_parquet")

Explore In-Memory Data Store Tachyon page:

the "tachyon" folder is now a subfolder of spark
TACHYON_WORKER_MEMORY_SIZE is already set at 1GB
When I try to format the storage using the command "tachyon format", class tachyon.Format cannot be found: to fix:

  export TACHYON_JARS="$TACHYON_HOME/../lib/tachyon-assemblies-${VERSION}-jar-with-dependencies.jar"

the command "tachyon runTests" fails all the tests
In the section "Run Spark on Tachyon", the command " ./bin/spark-shell" is specific to only Scala. Should be generalized for users using other languages, e.g. Python

Querying compressed RDDs with Succinct Spark page:

Correct "articleIds.count" to say "articleIdsRDD.count"
"val succinctWikiKV = wikiKV.map(t => (t._1, t._2.getBytes).succinctKV" is missing an ending parentheses, i.e. ")".
Should combine

val wikiKV2 = sc.textFile("data/succinct/wiki-large.txt")
    .map(_.split('|'))
    .map(t => (t(0).toLong, t(1)))

into one line

val wikiKV2 = sc.textFile("data/succinct/wiki-large.txt").map(_.split('|')).map(t => (t(0).toLong, t(1)))

val wikiSuccinctKV2 = sc.succinctKV[Long]("data/succinct/succinct-wiki-large")
wikiSuccinctKV2.count

val succinctWikiKV2 = sc.succinctKV[Long]("data/succinct/succinct-wiki-large")
succinctWikiKV2.count

Change "val articleIdsRDD3= succinctWikiKV3.regexSearch("(stanford|berkeley).edu")" to "val articleIdsRDD3= succinctWikiKV2.regexSearch("(stanford|berkeley).edu")"

Nov 10 '15 02:11 tranlm

The line

export TACHYON_JARS="$TACHYON_HOME/../lib/tachyon-assemblies-${VERSION}-jar-with-dependencies.jar"

is an edit to a line in spark/tachyon/libexec/tachyon-config.sh

Nov 10 '15 03:11 gostevehoward