Youngbin Kim
Youngbin Kim
@kismsu we don't yet officially support Koalas with DB Connect (for any versions). It seems like this particular issue might be avoided with 7.1, but it could have other issues...
Pull request (https://github.com/lintool/warcbase/pull/230) **Usage example:** ``` scala import org.warcbase.spark.rdd.RecordRDD._ import org.warcbase.spark.matchbox.RecordLoader val recs=RecordLoader.loadArchives("/collections/webarchives/geocities/warcs/", sc) .keepUrlPatterns(Set("http://geocities.com/EnchantedForest/.*".r)) val clusters = ExtractClusters(recs, sc) .topNWords("GEO_ENCHANTED_FOREST_TOP_N", sc) .computeLDA("GEO_ENCHANTED_FOREST_LDA", sc) .saveSampleDocs("GEO_ENCHANTED_FOREST_LDA", sc) ``` **APIs:** ``` scala...
Errors were deserialization errors. It's odd that those errors have occurred because I have seen those errors and fixed it. I've tried to run it again a couple of times,...
PR: https://github.com/lintool/warcbase/pull/237 UDF for computing the MD5 checksum => ComputeChecksum.get(url: String, timeoutVal: Int = 5000, removeIconImage: Boolean = false, minWidth: Int = 30, minHeight: Int = 30) ``` ComputeChecksum.get("https://avatars1.githubusercontent.com/u/7608739?v=3&s=96") ```...
For now, please use ``` import org.warcbase.spark.matchbox._ import org.warcbase.spark.rdd.RecordRDD._ import org.warcbase.spark.matchbox.RecordLoader val r = RecordLoader.loadArchives("/collections/webarchives/geocities/warcs/*.warc.gz",sc).persist() val arr = r.ExtractPopularImages(r, 2000) sc.parallelize(arr.map(x=>x._2._2 + "\t" + x._2._3), 1).saveAsTextFile("2000-Popular-Images-Geocities13") ``` Thanks.
The output type is modified to rdd. (https://github.com/lintool/warcbase/pull/241). New script will be: ``` import org.warcbase.spark.matchbox._ import org.warcbase.spark.rdd.RecordRDD._ import org.warcbase.spark.matchbox.RecordLoader val r = RecordLoader.loadArchives("/collections/webarchives/geocities/warcs/*.warc.gz",sc).persist() ExtractPopularImages(r, 2000, sc).saveAsTextFile("2000-Popular-Images-Geocities14") ``` And for subsets,...