extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

Not enough heap space with XMLSource

Open jeschkies opened this issue 11 years ago • 7 comments

Hi,

I use the dbpedia extraction framework to extract link and category information. I have this:

val source = XMLSource.fromFile(new File("enwiki-latest-pages-articles.xml"), Language.English)
source.toIterable
         .zipWithIndex
         .map { page: WikiPage => //case(page: WikiPage, i: Int) =>
                       if(i%1000 == 0){println(i)}
                      val p = parser(page)
                     (Article(p.id, p.title.decoded, p.toPlainText),
                      p.children flatMap extractCategories(p) _,
                     p.children flatMap extractLinks(p) _)
                }
        .grouped(2000)
        .foreach {
             batch: Iterable[(Article, List[Category], List[Link])] =>
                //Save each batch in one transaction
               database withTransaction {
                  implicit session =>
                    val u = batch.unzip3
                    articles ++= u._1
                    categories ++= u._2.flatten
                    links ++= u._3.flatten
               }
        }

I thought that XMLSource would load/parse the dump lazily. Is that correct? I have have Xmx set to Xmx6G.

Thx, Karsten

jeschkies avatar Dec 11 '13 11:12 jeschkies

I think it because of zipWithIndex [1]

I would simplify it to something like:

val parser = WikiParser.getInstance()
var i = 0
for (page <- source.map(parser)) {
      ...
      i++
}

[1] http://daily-scala.blogspot.gr/2010/05/zipwithindex.html

jimkont avatar Dec 11 '13 11:12 jimkont

Hm, but toIteraable.zipWithIndex should create just another Iterable. So there should not be much more usage.

I need to use grouped etc to save to the database in batches. I also had source.map(parser) but I ran out of memory even for just a tiny fraction of the dump.

Your suggestion is okay but not really Scala like ; ). there should be a way to use collection operations without allocating memory.

jeschkies avatar Dec 11 '13 12:12 jeschkies

Okay, I dug a little into the code. So far there is hardly any chance to build an iterator. XMLSource calls WikipediaDumpParser which receives the function and applies this function to each page. That is why map for example works fine but grouped is not allowed. For group etc to work we would need an iterator that yields a page instead of applying a function.

toIterator etc. is provided but traverses the complete dump before it is handled.

So there are two options I see:

  1. Stick with foreach and group by creating a new list and then push it to the database. That is okay but not very elegant.
  2. Have a refactoring of XMLSource / WikipediaDumpParser that yields pages.

What do you think?

EDIT The first option works:

var batch = List[(Article, List[Category], List[Link])]()
source foreach {
  unparsedPage: WikiPage =>
    val p = parser(unparsedPage)
    batch = (Article(p.id, p.title.decoded, p.toPlainText),
               p.children flatMap extractCategories(p) _,
               p.children flatMap extractLinks(p) _) :: batch

  if(batch.length >= 100) {
     database withTransaction {
       implicit session =>
         println("Save batch to database...")
           val u = batch.unzip3
           articles ++= u._1
           categories ++= u._2.flatten
           links ++= u._3.flatten
    }
   batch = List[(Article, List[Category], List[Link])]()
 }
}

However, this is not as flexible as an iterator would be.

jeschkies avatar Dec 11 '13 14:12 jeschkies

  1. Have a refactoring of XMLSource / WikipediaDumpParser that yields pages.

What would that mean exactly? How many methods would we have to override in XMLSource?

We could probably make WikipediaDumpParser implement a next() method instead of taking a callback. I don't know how much work that would be. Maybe just a little, maybe quite a lot. (Implementation detail: having a next() method that simply returns null when there is no more data often leads to cleaner code in the implementing class than the hasNext()/next() API, which is nicer for client classes. It's easy to wrap the former in the latter if necessary.)

jcsahnwaldt avatar Dec 11 '13 15:12 jcsahnwaldt

If I am not mistaken Source or XMLSource should extend Iterable and not Traversable:

XMLSource(...) extends Iterable[WIkiPage] {
  val wikiParser = new WikipediaDumpParser(....)
  def iterator = new Iterator {
    def hasNext = ....
    def next: WikiPage = wikiParser.nextPage()  
  }
}

Something like this should work. I am new to Scala so I am not sure if calls such as XMLSource(..).map(parser).zipWithIndex.grouped.unzip will be lazy but the should be.

jeschkies avatar Dec 11 '13 16:12 jeschkies

It can extend both, Can you try this and if it works submit a pull request?

On Wed, Dec 11, 2013 at 6:07 PM, Karsten Jeschkies <[email protected]

wrote:

If I am not mistaken Source or XMLSource should extend Iterable and not Traversable:

XMLSource(...) extends Iterable[WIkiPage] { val wikiParser = new WikipediaDumpParser(....) def iterator = new Iterator { def hasNext = .... def next: WikiPage = wikiParser.nextPage() } }

Something like this should work. I am new to Scala so I am not sure if calls such as XMLSource(..).map(parser).zipWithIndex.grouped.unzip will be lazy but the should be.

— Reply to this email directly or view it on GitHubhttps://github.com/dbpedia/extraction-framework/issues/140#issuecomment-30333569 .

Kontokostas Dimitris

jimkont avatar Dec 11 '13 17:12 jimkont

This is what I have now.

So far there has to be a cast:

source.asInstanceOf[XMLReaderSource]
      .iterable
      .view
      .flatMap { (p: WikiPage) =>
        try Some(parser(p))
        catch {
          case _: Throwable =>
            println(s"Could not parse $p.title.decoded")
            None
        }
      }
      .zipWithIndex
      .map { case(p: PageNode, i: Int) =>
               if(i%1000 == 0){println(i)}
               ...
           }
      .grouped(5000)
      .foreach { ...}

Note that view makes it memory independent.

jeschkies avatar Dec 19 '13 14:12 jeschkies