extraction-framework
extraction-framework copied to clipboard
Not enough heap space with XMLSource
Hi,
I use the dbpedia extraction framework to extract link and category information. I have this:
val source = XMLSource.fromFile(new File("enwiki-latest-pages-articles.xml"), Language.English)
source.toIterable
.zipWithIndex
.map { page: WikiPage => //case(page: WikiPage, i: Int) =>
if(i%1000 == 0){println(i)}
val p = parser(page)
(Article(p.id, p.title.decoded, p.toPlainText),
p.children flatMap extractCategories(p) _,
p.children flatMap extractLinks(p) _)
}
.grouped(2000)
.foreach {
batch: Iterable[(Article, List[Category], List[Link])] =>
//Save each batch in one transaction
database withTransaction {
implicit session =>
val u = batch.unzip3
articles ++= u._1
categories ++= u._2.flatten
links ++= u._3.flatten
}
}
I thought that XMLSource would load/parse the dump lazily. Is that correct? I have have Xmx set to Xmx6G.
Thx, Karsten
I think it because of zipWithIndex [1]
I would simplify it to something like:
val parser = WikiParser.getInstance()
var i = 0
for (page <- source.map(parser)) {
...
i++
}
[1] http://daily-scala.blogspot.gr/2010/05/zipwithindex.html
Hm, but toIteraable.zipWithIndex should create just another Iterable. So there should not be much more usage.
I need to use grouped etc to save to the database in batches. I also had source.map(parser) but I ran out of memory even for just a tiny fraction of the dump.
Your suggestion is okay but not really Scala like ; ). there should be a way to use collection operations without allocating memory.
Okay, I dug a little into the code. So far there is hardly any chance to build an iterator. XMLSource calls WikipediaDumpParser which receives the function and applies this function to each page. That is why map
for example works fine but grouped
is not allowed. For group etc to work we would need an iterator that yields a page instead of applying a function.
toIterator
etc. is provided but traverses the complete dump before it is handled.
So there are two options I see:
- Stick with
foreach
and group by creating a new list and then push it to the database. That is okay but not very elegant. - Have a refactoring of
XMLSource
/WikipediaDumpParser
that yields pages.
What do you think?
EDIT The first option works:
var batch = List[(Article, List[Category], List[Link])]()
source foreach {
unparsedPage: WikiPage =>
val p = parser(unparsedPage)
batch = (Article(p.id, p.title.decoded, p.toPlainText),
p.children flatMap extractCategories(p) _,
p.children flatMap extractLinks(p) _) :: batch
if(batch.length >= 100) {
database withTransaction {
implicit session =>
println("Save batch to database...")
val u = batch.unzip3
articles ++= u._1
categories ++= u._2.flatten
links ++= u._3.flatten
}
batch = List[(Article, List[Category], List[Link])]()
}
}
However, this is not as flexible as an iterator would be.
- Have a refactoring of XMLSource / WikipediaDumpParser that yields pages.
What would that mean exactly? How many methods would we have to override in XMLSource?
We could probably make WikipediaDumpParser implement a next() method instead of taking a callback. I don't know how much work that would be. Maybe just a little, maybe quite a lot. (Implementation detail: having a next() method that simply returns null when there is no more data often leads to cleaner code in the implementing class than the hasNext()/next() API, which is nicer for client classes. It's easy to wrap the former in the latter if necessary.)
If I am not mistaken Source or XMLSource should extend Iterable and not Traversable:
XMLSource(...) extends Iterable[WIkiPage] {
val wikiParser = new WikipediaDumpParser(....)
def iterator = new Iterator {
def hasNext = ....
def next: WikiPage = wikiParser.nextPage()
}
}
Something like this should work. I am new to Scala so I am not sure if calls such as XMLSource(..).map(parser).zipWithIndex.grouped.unzip
will be lazy but the should be.
It can extend both, Can you try this and if it works submit a pull request?
On Wed, Dec 11, 2013 at 6:07 PM, Karsten Jeschkies <[email protected]
wrote:
If I am not mistaken Source or XMLSource should extend Iterable and not Traversable:
XMLSource(...) extends Iterable[WIkiPage] { val wikiParser = new WikipediaDumpParser(....) def iterator = new Iterator { def hasNext = .... def next: WikiPage = wikiParser.nextPage() } }
Something like this should work. I am new to Scala so I am not sure if calls such as XMLSource(..).map(parser).zipWithIndex.grouped.unzip will be lazy but the should be.
— Reply to this email directly or view it on GitHubhttps://github.com/dbpedia/extraction-framework/issues/140#issuecomment-30333569 .
Kontokostas Dimitris
This is what I have now.
So far there has to be a cast:
source.asInstanceOf[XMLReaderSource]
.iterable
.view
.flatMap { (p: WikiPage) =>
try Some(parser(p))
catch {
case _: Throwable =>
println(s"Could not parse $p.title.decoded")
None
}
}
.zipWithIndex
.map { case(p: PageNode, i: Int) =>
if(i%1000 == 0){println(i)}
...
}
.grouped(5000)
.foreach { ...}
Note that view
makes it memory independent.