training icon indicating copy to clipboard operation
training copied to clipboard

ampcamp6: parquet read & tachyon config bugs

Open tranlm opened this issue 9 years ago • 1 comments

Running through the exercise code, here are some issues I found:

Data Exploration using Spark SQL page:

  1. "parquetFile" has been deprecated and the resulting code should be changed to wikiData = sqlCtx.read.parquet("data/wiki_parquet")

Explore In-Memory Data Store Tachyon page:

  1. the "tachyon" folder is now a subfolder of spark
  2. TACHYON_WORKER_MEMORY_SIZE is already set at 1GB
  3. When I try to format the storage using the command "tachyon format", class tachyon.Format cannot be found: to fix:
  export TACHYON_JARS="$TACHYON_HOME/../lib/tachyon-assemblies-${VERSION}-jar-with-dependencies.jar"
  1. the command "tachyon runTests" fails all the tests
  2. In the section "Run Spark on Tachyon", the command " ./bin/spark-shell" is specific to only Scala. Should be generalized for users using other languages, e.g. Python

Querying compressed RDDs with Succinct Spark page:

  1. Correct "articleIds.count" to say "articleIdsRDD.count"
  2. "val succinctWikiKV = wikiKV.map(t => (t._1, t._2.getBytes).succinctKV" is missing an ending parentheses, i.e. ")".
  3. Should combine
val wikiKV2 = sc.textFile("data/succinct/wiki-large.txt")
    .map(_.split('|'))
    .map(t => (t(0).toLong, t(1)))

into one line

val wikiKV2 = sc.textFile("data/succinct/wiki-large.txt").map(_.split('|')).map(t => (t(0).toLong, t(1)))
  1. Change
val wikiSuccinctKV2 = sc.succinctKV[Long]("data/succinct/succinct-wiki-large")
wikiSuccinctKV2.count

to

val succinctWikiKV2 = sc.succinctKV[Long]("data/succinct/succinct-wiki-large")
succinctWikiKV2.count
  1. Change "val articleIdsRDD3= succinctWikiKV3.regexSearch("(stanford|berkeley).edu")" to "val articleIdsRDD3= succinctWikiKV2.regexSearch("(stanford|berkeley).edu")"

tranlm avatar Nov 10 '15 02:11 tranlm

The line

export TACHYON_JARS="$TACHYON_HOME/../lib/tachyon-assemblies-${VERSION}-jar-with-dependencies.jar"

is an edit to a line in spark/tachyon/libexec/tachyon-config.sh

gostevehoward avatar Nov 10 '15 03:11 gostevehoward