extraction-framework
                                
                                
                                
                                    extraction-framework copied to clipboard
                            
                            
                            
                        Not enough heap space with XMLSource
Hi,
I use the dbpedia extraction framework to extract link and category information. I have this:
val source = XMLSource.fromFile(new File("enwiki-latest-pages-articles.xml"), Language.English)
source.toIterable
         .zipWithIndex
         .map { page: WikiPage => //case(page: WikiPage, i: Int) =>
                       if(i%1000 == 0){println(i)}
                      val p = parser(page)
                     (Article(p.id, p.title.decoded, p.toPlainText),
                      p.children flatMap extractCategories(p) _,
                     p.children flatMap extractLinks(p) _)
                }
        .grouped(2000)
        .foreach {
             batch: Iterable[(Article, List[Category], List[Link])] =>
                //Save each batch in one transaction
               database withTransaction {
                  implicit session =>
                    val u = batch.unzip3
                    articles ++= u._1
                    categories ++= u._2.flatten
                    links ++= u._3.flatten
               }
        }
I thought that XMLSource would load/parse the dump lazily. Is that correct? I have have Xmx set to Xmx6G.
Thx, Karsten
I think it because of zipWithIndex [1]
I would simplify it to something like:
val parser = WikiParser.getInstance()
var i = 0
for (page <- source.map(parser)) {
      ...
      i++
}
[1] http://daily-scala.blogspot.gr/2010/05/zipwithindex.html
Hm, but toIteraable.zipWithIndex should create just another Iterable. So there should not be much more usage.
I need to use grouped etc to save to the database in batches. I also had source.map(parser) but I ran out of memory even for just a tiny fraction of the dump.
Your suggestion is okay but not really Scala like ; ). there should be a way to use collection operations without allocating memory.
Okay, I dug a little into the code. So far there is hardly any chance to build an iterator. XMLSource calls WikipediaDumpParser which receives the function and applies this function to each page. That is why map for example works fine but grouped is not allowed. For group etc to work we would need an iterator that yields a page instead of applying a function.
toIterator etc. is provided but traverses the complete dump before it is handled.
So there are two options I see:
- Stick with 
foreachand group by creating a new list and then push it to the database. That is okay but not very elegant. - Have a refactoring of 
XMLSource/WikipediaDumpParserthat yields pages. 
What do you think?
EDIT The first option works:
var batch = List[(Article, List[Category], List[Link])]()
source foreach {
  unparsedPage: WikiPage =>
    val p = parser(unparsedPage)
    batch = (Article(p.id, p.title.decoded, p.toPlainText),
               p.children flatMap extractCategories(p) _,
               p.children flatMap extractLinks(p) _) :: batch
  if(batch.length >= 100) {
     database withTransaction {
       implicit session =>
         println("Save batch to database...")
           val u = batch.unzip3
           articles ++= u._1
           categories ++= u._2.flatten
           links ++= u._3.flatten
    }
   batch = List[(Article, List[Category], List[Link])]()
 }
}
However, this is not as flexible as an iterator would be.
- Have a refactoring of XMLSource / WikipediaDumpParser that yields pages.
 
What would that mean exactly? How many methods would we have to override in XMLSource?
We could probably make WikipediaDumpParser implement a next() method instead of taking a callback. I don't know how much work that would be. Maybe just a little, maybe quite a lot. (Implementation detail: having a next() method that simply returns null when there is no more data often leads to cleaner code in the implementing class than the hasNext()/next() API, which is nicer for client classes. It's easy to wrap the former in the latter if necessary.)
If I am not mistaken Source or XMLSource should extend Iterable and not Traversable:
XMLSource(...) extends Iterable[WIkiPage] {
  val wikiParser = new WikipediaDumpParser(....)
  def iterator = new Iterator {
    def hasNext = ....
    def next: WikiPage = wikiParser.nextPage()  
  }
}
Something like this should work. I am new to Scala so I am not sure if calls such as XMLSource(..).map(parser).zipWithIndex.grouped.unzip will be lazy but the should be.
It can extend both, Can you try this and if it works submit a pull request?
On Wed, Dec 11, 2013 at 6:07 PM, Karsten Jeschkies <[email protected]
wrote:
If I am not mistaken Source or XMLSource should extend Iterable and not Traversable:
XMLSource(...) extends Iterable[WIkiPage] { val wikiParser = new WikipediaDumpParser(....) def iterator = new Iterator { def hasNext = .... def next: WikiPage = wikiParser.nextPage() } }
Something like this should work. I am new to Scala so I am not sure if calls such as XMLSource(..).map(parser).zipWithIndex.grouped.unzip will be lazy but the should be.
— Reply to this email directly or view it on GitHubhttps://github.com/dbpedia/extraction-framework/issues/140#issuecomment-30333569 .
Kontokostas Dimitris
This is what I have now.
So far there has to be a cast:
source.asInstanceOf[XMLReaderSource]
      .iterable
      .view
      .flatMap { (p: WikiPage) =>
        try Some(parser(p))
        catch {
          case _: Throwable =>
            println(s"Could not parse $p.title.decoded")
            None
        }
      }
      .zipWithIndex
      .map { case(p: PageNode, i: Int) =>
               if(i%1000 == 0){println(i)}
               ...
           }
      .grouped(5000)
      .foreach { ...}
Note that view makes it memory independent.