distributed-extraction-framework
distributed-extraction-framework copied to clipboard
Add progress logging at the end of the job to print node-wise statistics
Currently the final lines of the extraction output look like:
Jul 09, 2014 5:19:20 AM org.dbpedia.extraction.mappings.DistRedirects$ load
INFO: Will extract redirects from source for li wiki, could not load cache file '/home/nilesh/gsoc14/out10/liwiki/20140410/liwiki-20140410-template-redirects.obj': java.io.FileNotFoundException: File /home/nilesh/gsoc14/out10/liwiki/20140410/liwiki-20140410-template-redirects.obj does not exist
Jul 09, 2014 5:19:20 AM org.dbpedia.extraction.mappings.DistRedirects$ loadFromRDD
INFO: Loading redirects from source (li)
14/07/09 05:19:20 INFO DBpediaJobProgressListener: Started job #0
14/07/09 05:19:20 INFO DBpediaJobProgressListener: Stage #0: Starting stage collectAsMap at DistRedirects.scala:149 with 8 tasks at 00:00.000s
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #0 on host archangel-lapi7, executor 1 at 00:05.815s. Total tasks submitted: 1
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #1 on host archangel-lapi7, executor 1 at 00:05.833s. Total tasks submitted: 2
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #2 on host archangel-lapi7, executor 1 at 00:05.834s. Total tasks submitted: 3
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #3 on host archangel-lapi7, executor 1 at 00:05.834s. Total tasks submitted: 4
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #4 on host archangel-lapi7, executor 1 at 00:05.835s. Total tasks submitted: 5
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #5 on host archangel-lapi7, executor 1 at 00:05.836s. Total tasks submitted: 6
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #6 on host archangel-lapi7, executor 1 at 00:05.837s. Total tasks submitted: 7
14/07/09 05:19:23 INFO DBpediaJobProgressListener: Stage #0: Started task #7 on host archangel-lapi7, executor 1 at 00:05.838s. Total tasks submitted: 8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #0 at 00:13.276s. Completed: 1/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #6 at 00:13.416s. Completed: 2/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #1 at 00:13.430s. Completed: 3/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #5 at 00:13.556s. Completed: 4/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #3 at 00:13.567s. Completed: 5/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #7 at 00:13.665s. Completed: 6/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #4 at 00:13.683s. Completed: 7/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished task #2 at 00:13.791s. Completed: 8/8
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Stage #0: Finished stage collectAsMap at DistRedirects.scala:149 at 00:13.800s
14/07/09 05:19:31 INFO DBpediaJobProgressListener: Finished job #0
Jul 09, 2014 5:19:31 AM org.dbpedia.extraction.mappings.DistRedirects$ loadFromRDD
INFO: Redirects loaded from source (li)
Jul 09, 2014 5:19:31 AM org.dbpedia.extraction.mappings.DistRedirects$ load
INFO: 101 redirects written to cache file /home/nilesh/gsoc14/out10/liwiki/20140410/liwiki-20140410-template-redirects.obj
Jul 09, 2014 5:19:32 AM org.dbpedia.extraction.dump.extract.DistExtractionJob run
INFO: li: 14 extractors (ArticleCategoriesExtractor,ArticleTemplatesExtractor,CategoryLabelExtractor,ExternalLinksExtractor,GeoExtractor,InterLanguageLinksExtractor,LabelExtractor,PageIdExtractor,PageLinksExtractor,RedirectExtractor,RevisionIdExtractor,ProvenanceExtractor,SkosCategoriesExtractor,ArticlePageExtractor), 14 datasets (page_links,revision_ids,page_ids,revision_uris,article_categories,skos_categories,labels,wikipedia_links,external_links,redirects,geo_coordinates,article_templates,category_labels,interlanguage_links) started
Jul 09, 2014 5:19:32 AM org.dbpedia.extraction.dump.extract.DistExtractionJob run
INFO: Writing outputs to destination...
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Started job #1
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Starting stage saveAsNewAPIHadoopFile at DistDeduplicatingWriterDestination.scala:35 with 8 tasks at 00:00.000s
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #8 on host archangel-lapi7, executor 1 at 00:14.756s. Total tasks submitted: 1
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #9 on host archangel-lapi7, executor 1 at 00:14.757s. Total tasks submitted: 2
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #10 on host archangel-lapi7, executor 1 at 00:14.758s. Total tasks submitted: 3
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #11 on host archangel-lapi7, executor 1 at 00:14.759s. Total tasks submitted: 4
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #12 on host archangel-lapi7, executor 1 at 00:14.760s. Total tasks submitted: 5
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #13 on host archangel-lapi7, executor 1 at 00:14.761s. Total tasks submitted: 6
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #14 on host archangel-lapi7, executor 1 at 00:14.762s. Total tasks submitted: 7
14/07/09 05:19:32 INFO DBpediaJobProgressListener: Stage #1: Started task #15 on host archangel-lapi7, executor 1 at 00:14.763s. Total tasks submitted: 8
14/07/09 05:19:39 INFO DBpediaJobProgressListener: Stage #1: Finished task #15 at 00:21.974s. Completed: 1/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #12 at 00:22.097s. Completed: 2/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #11 at 00:22.235s. Completed: 3/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #8 at 00:22.296s. Completed: 4/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #9 at 00:22.774s. Completed: 5/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #14 at 00:22.855s. Completed: 6/8
14/07/09 05:19:40 INFO DBpediaJobProgressListener: Stage #1: Finished task #10 at 00:22.862s. Completed: 7/8
14/07/09 05:19:41 INFO DBpediaJobProgressListener: Stage #1: Finished task #13 at 00:23.052s. Completed: 8/8
14/07/09 05:19:41 INFO DBpediaJobProgressListener: Stage #1: Finished stage saveAsNewAPIHadoopFile at DistDeduplicatingWriterDestination.scala:35 at 00:23.058s
14/07/09 05:19:41 INFO DBpediaJobProgressListener: Finished job #1
li: extracted 16556 pages in 00:08.601s (per page: 0.519510 ms; failed pages: 0).
Jul 09, 2014 5:19:41 AM org.dbpedia.extraction.dump.extract.DistExtractionJob run
INFO: li: 14 extractors (ArticleCategoriesExtractor,ArticleTemplatesExtractor,CategoryLabelExtractor,ExternalLinksExtractor,GeoExtractor,InterLanguageLinksExtractor,LabelExtractor,PageIdExtractor,PageLinksExtractor,RedirectExtractor,RevisionIdExtractor,ProvenanceExtractor,SkosCategoriesExtractor,ArticlePageExtractor), 14 datasets (page_links,revision_ids,page_ids,revision_uris,article_categories,skos_categories,labels,wikipedia_links,external_links,redirects,geo_coordinates,article_templates,category_labels,interlanguage_links) finished
It would be good to have lines like "node X: Y pages written" too.
@jimkont After the currently pending 3 PRs are merged to master, it'd be great if you could test the framework out for yourself (I'll update the README right now so that it's all in there) and let me know if the logging is satisfactory and whether we can close this.