training
training copied to clipboard
ampcamp6: parquet read & tachyon config bugs
Running through the exercise code, here are some issues I found:
Data Exploration using Spark SQL page:
- "parquetFile" has been deprecated and the resulting code should be changed to wikiData = sqlCtx.read.parquet("data/wiki_parquet")
Explore In-Memory Data Store Tachyon page:
- the "tachyon" folder is now a subfolder of spark
- TACHYON_WORKER_MEMORY_SIZE is already set at 1GB
- When I try to format the storage using the command "tachyon format", class tachyon.Format cannot be found: to fix:
export TACHYON_JARS="$TACHYON_HOME/../lib/tachyon-assemblies-${VERSION}-jar-with-dependencies.jar"
- the command "tachyon runTests" fails all the tests
- In the section "Run Spark on Tachyon", the command " ./bin/spark-shell" is specific to only Scala. Should be generalized for users using other languages, e.g. Python
Querying compressed RDDs with Succinct Spark page:
- Correct "articleIds.count" to say "articleIdsRDD.count"
- "val succinctWikiKV = wikiKV.map(t => (t._1, t._2.getBytes).succinctKV" is missing an ending parentheses, i.e. ")".
- Should combine
val wikiKV2 = sc.textFile("data/succinct/wiki-large.txt")
.map(_.split('|'))
.map(t => (t(0).toLong, t(1)))
into one line
val wikiKV2 = sc.textFile("data/succinct/wiki-large.txt").map(_.split('|')).map(t => (t(0).toLong, t(1)))
- Change
val wikiSuccinctKV2 = sc.succinctKV[Long]("data/succinct/succinct-wiki-large")
wikiSuccinctKV2.count
to
val succinctWikiKV2 = sc.succinctKV[Long]("data/succinct/succinct-wiki-large")
succinctWikiKV2.count
- Change "val articleIdsRDD3= succinctWikiKV3.regexSearch("(stanford|berkeley).edu")" to "val articleIdsRDD3= succinctWikiKV2.regexSearch("(stanford|berkeley).edu")"
The line
export TACHYON_JARS="$TACHYON_HOME/../lib/tachyon-assemblies-${VERSION}-jar-with-dependencies.jar"
is an edit to a line in spark/tachyon/libexec/tachyon-config.sh