Alex comments

Results 239 comments of


                                            Alex

Run Gemini file-level duplicate detection on PGA

blocked by src-d/backlog#1266

Run Gemini file-level duplicate detection on PGA

- FS are [running](http://127.0.0.1:8001/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard/#/pod/feature-extractor/fe-spark-spark-worker-748277027-1fsvk?namespace=feature-extractor) - `pga get` second round, ETA `8h52m56`

Run Gemini file-level duplicate detection on PGA

Full PGA was downloaded to HDFS 🎉 https://github.com/src-d/datasets/issues/53#issuecomment-396528917 ``` $ zgrep -o "[0-9a-z]*\.siva" ~/.pga/latest.csv.gz | sort | uniq | wc -l 239807 $ hdfs dfs -ls -R hdfs://hdfs-namenode/pga/siva/latest | grep...

Run Gemini file-level duplicate detection on PGA

Plan 1. WIP: run latest Gemini Hash \w latest Engine 0.6.3 on single shard (~10Gb, ~1000 repos, ~1/250 of whole) using current staging pipeline cluster configuration - [x] Executor page...

Run Gemini file-level duplicate detection on PGA

Blocked, as all Feature Extractors are deployed under https://github.com/src-d/issues-infrastructure/issues/184 are part of new, separate Apache Spark cluster in a different k8s namespace `-n feature-extractor`, that does not seem to have...

Run Gemini file-level duplicate detection on PGA

**Hash** has finished successfully, I'm submitting PRs now to Gemini that enabled it. **Report** is - `cc.makeBuckets()` 40min - `Report.findConnectedComponents()` ~6h

Run Gemini file-level duplicate detection on PGA

1h for hashing a ~1/250 of PGA on 3 machines of pipeline staging cluster ### Configuration ``` --conf "spark.executor.memory=16g" \ --conf "spark.local.dir=/spark-temp-data" \ --conf "spark.executor.extraJavaOptions='-Djava.io.tmpdir=/spark-temp-data -Dlog4j.configuration=log4j.properties'" \ --conf "spark.driver.memory=8g" \...

Run Gemini file-level duplicate detection on PGA

> Question: how are we sampling the repos for each of these tests? Good question. We always just used only a single shard of PGA dataset - all the repos,...

Run Gemini file-level duplicate detection on PGA

Local: 1mb, 30k features Cluster: 170Mb, 5.5mil features ### DataFrame local: 8sec, cluster: 4sec ```scala val freqDf = features.withColumnRenamed("_1", "feature").withColumnRenamed("_2", "doc") .select("feature", "doc") .distinct .groupBy("feature") .agg(count("*").alias("cnt")) .map(row => (row.getAs[String]("feature"), row.getAs[Long]("cnt")))...

Run Gemini file-level duplicate detection on PGA

There are 141 .siva files bigger then 1Gb, with rest 260+k being smaller. Those outliers can be moved, to get shorter tail of task execution time on average ``` hdfs...