rdf_context icon indicating copy to clipboard operation
rdf_context copied to clipboard

Parsing very slow on larger files

Open ijdickinson opened this issue 15 years ago • 1 comments

I'm reading in a bunch of RDF files, each into their own RdfContext::Graph. The results below show the timings I'm getting. Small files load just fine; larger files take disproportionately long. One file takes 8.5 minutes to load 38k triples. I'm running on a quad-core 64 bit Ubuntu system with 8Gb memory and using ruby 1.9.1, so I don't think the raw performance of the machine is an issue.

log file output:

loading concept definitions... Initializing coins_concept with target/def/sector.nt ... parsing complete in 0.1s producing 39 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/sector Initializing coins_concept with target/def/data-type.nt ... parsing complete in 1.6s producing 487 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/data-type Initializing coins_concept with target/def/programme-admin.nt ... parsing complete in 0.2s producing 47 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/programme-admin Initializing coins_concept with target/def/cga-body-type.nt ... parsing complete in 0.2s producing 47 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/cga-body-type Initializing coins_concept with target/def/resource-capital.nt ... parsing complete in 0.1s producing 39 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/resource-capital Initializing coins_concept with target/def/pesa-transfer.nt ... parsing complete in 0.3s producing 87 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-transfer Initializing coins_concept with target/def/account-code.nt ... parsing complete in 20.2s producing 4711 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/account-code Initializing coins_concept with target/def/estimate-number.nt ... parsing complete in 2.5s producing 503 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-number Initializing coins_concept with target/def/cofog.nt ... parsing complete in 4.5s producing 1271 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/cofog Initializing coins_concept with target/def/department-code.nt ... parsing complete in 3.3s producing 847 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/department-code Initializing coins_concept with target/def/budget-capital-current.nt ... parsing complete in 0.3s producing 47 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/budget-capital-current Initializing coins_concept with target/def/request-for-resources-next-year.nt ... parsing complete in 0.2s producing 63 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/request-for-resources-next-year Initializing coins_concept with target/def/counterparty-code.nt ... parsing complete in 1.7s producing 431 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/counterparty-code Initializing coins_concept with target/def/pesa-delivery.nt ... parsing complete in 0.1s producing 31 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-delivery Initializing coins_concept with target/def/income-category.nt ... parsing complete in 0.5s producing 111 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/income-category Initializing coins_concept with target/def/estimate-line.nt ... parsing complete in 2.1s producing 615 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-line Initializing coins_concept with target/def/programme-object-group-code.nt ... parsing complete in 125.7s producing 15895 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/programme-object-group-code Initializing coins_concept with target/def/estimates-aina.nt ... parsing complete in 0.1s producing 39 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimates-aina Initializing coins_concept with target/def/estimates-capital-current.nt ... parsing complete in 2.1s producing 63 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimates-capital-current Initializing coins_concept with target/def/activity-code.nt ... parsing complete in 6.0s producing 1375 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/activity-code Initializing coins_concept with target/def/estimate-number-next-year.nt ... parsing complete in 2.4s producing 503 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-number-next-year Initializing coins_concept with target/def/accounting-authority.nt ... parsing complete in 0.9s producing 159 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/accounting-authority Initializing coins_concept with target/def/pesa-current-grants.nt ... parsing complete in 1.0s producing 215 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-current-grants Initializing coins_concept with target/def/estimate-line-next-year.nt ... parsing complete in 2.8s producing 615 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-line-next-year Initializing coins_concept with target/def/request-for-resources.nt ... parsing complete in 0.2s producing 63 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/request-for-resources Initializing coins_concept with target/def/pesa-services.nt ... parsing complete in 0.4s producing 39 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-services Initializing coins_concept with target/def/estimate-line-last-year.nt ... parsing complete in 2.6s producing 575 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-line-last-year Initializing coins_concept with target/def/nac.nt ... parsing complete in 4.0s producing 951 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/nac Initializing coins_concept with target/def/estimate-number-last-year.nt ... parsing complete in 2.5s producing 495 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/estimate-number-last-year Initializing coins_concept with target/def/budget-boundary.nt ... parsing complete in 0.1s producing 39 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/budget-boundary Initializing coins_concept with target/def/pesa-1.1.nt ... parsing complete in 0.1s producing 31 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/pesa-1.1 Initializing coins_concept with target/def/esa.nt ... parsing complete in 2.6s producing 543 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/esa Initializing coins_concept with target/def/territory.nt ... parsing complete in 0.2s producing 71 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/territory Initializing coins_concept with target/def/data-subtype.nt ... parsing complete in 2.3s producing 471 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/data-subtype Initializing coins_concept with target/def/department-group.nt ... parsing complete in 2.1s producing 439 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/department-group Initializing coins_concept with target/def/signage.nt ... parsing complete in 0.1s producing 31 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/signage Initializing coins_concept with target/def/request-for-resources-last-year.nt ... parsing complete in 0.4s producing 63 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/request-for-resources-last-year Initializing coins_concept with target/def/programme-object-code.nt ... parsing complete in 513.3s producing 38855 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/programme-object-code Initializing coins_concept with target/def/sbi.nt ... parsing complete in 8.1s producing 455 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/sbi Initializing coins_concept with target/def/time.nt ... parsing complete in 0.8s producing 119 triples ... indexed as http://finance.data/gov.uk/def/statistical-concept/time Total time taken 720.9s

The files are in n-triples format: I also tried with Turtle input but gave up after waiting too long! I've tried with :list_store and :memory_store, it doesn't make much difference.

My guess is that something in the parser loop is not scaling linearly with the size of the input file, but that's just a guess. I don't think there's anything special about the input files themselves, but am happy to provide copies if that helps with debugging.

Ian

ijdickinson avatar Jul 01 '10 00:07 ijdickinson

The SQLite3 store will provide persistent storage, and may scale better for even larger graphs, but it is slower for smaller graphs. That would be :store => SQLite3.new(:path => "store.db"). You may have also found a memory leak within the Parser. The NTriples parser is the same as the Turtle/N3, so that could be an issue. Do you have the same problem parsing large files in other serializations?

If you have a script to run through these, I'll check it out.

Also, note that the same parsers and serializers in RdfContext are also available through RDF.rb as rdf-rdfa, rdf-n3 and rdf-rdfxml. RDF.rb has a richer infrastructure for graph storage than RdfContext. I've also noticed that RDF/XML parsing is substantially faster, due to some underlying optimizations in that implementation.

gkellogg avatar Jul 01 '10 01:07 gkellogg