formatting wiki pagerank
I couldn't find any links to the wiki dataset used so I downloaded them from wikimedia. When I run the pagerank I get weird page titles though, so midway the code I wanted to know what titles were beeing utilised. Is this normal:
(also: where can I find the proper dataset used in the amplab)
scala> vertices.take(50)
res14: Array[(org.apache.spark.graphx.VertexId, String)] = Array((0,""), (0,""), (0,""), (1728454431,* Toby, Marlene. ''A.A. Milne, Author of Winnie-the-Pooh''. Chicago: Childrens Press, 1995. ISBN), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (124,|), (103299066,|honorific-prefix), (117890311,|name), (191986873,|honorific-suffix), (-644639137,|image), (1667503647,|order1), (-188590855,|office1), (1077807430,|term_start1), (-866339411,|term_end1), (-1122312309,|monarch1), (1499463684,|governor-general1), (-1568689980,|predecessor1), (-217292473,|successor1), (1685055658,|birth_date), (708509579,|birth_place), (368907285,|death...
Final output:
printing the top 10 ranked pages:
''7.08: 0.15
color:oligocene bar:NAM21 from:: 0.15
|url = http://books.google.ca/books?id=aQ84ViBNkYwC&lpg=PR1&dq=Michael%20Jordan&pg=PR1#v=onepage&q&f=true|publisher=Greenwood Press |isbn=: 0.15
*Twelve Foot Change: 0.15
''(0.02/.08): 0.15
In mammals and birds, sleep is divided into two broad types: [[rapid eye movement sleep|rapid eye movement]](REM sleep) and [[non-rapid eye movement sleep|non-rapid eye movement]](NREM or non-REM sleep). Each type has a distinct set of physiological and neurological features associated with it. REM sleep is associated with the capability of dreaming.<ref name="National">{National Institute of Neurological Disorders and Stroke. (21 May 2007). Brain basics:: 0.15
*2035: 0.15
| commands: 0.15 |
|---|
| 39411: 0.15 |
|---|
QJT 2½: 0.15
printing the most important page within the subgraph of Wikipedia that mentions Berkeley in the title: