birt icon indicating copy to clipboard operation
birt copied to clipboard

Slow XML datasource (especially when using xpath)

Open patric-r opened this issue 4 years ago • 10 comments

Following up discussion #759, I've created an example project which reproduces the problem: we have one 300kb xml, one xml data source and two data sets (parent and child)

We iterate through the parent dataset and for each entry in the parent dataset, we iterate through each of the children of the parent entry (multiple times just to see the issue clearly). As BIRT does not support tree-like datasets, we use xpath and data set parameters in order to filter the correct childs.

Even though it is a simple report and a tiny-sized input file, report generation (PDF) takes ~10 seconds on a high-end workstation. BIRT_Slow_xml_processing_with_filters_and_xpath.zip

patric-r avatar Nov 30 '21 17:11 patric-r

sampling based profiling results (for one runandrender run):

image

when looking at the invocation counts for datatools object constructors, you can clearly see that nothing is cached here, XML is parsed 100 or even 200 times and many many datatools xml objects are created:

image

patric-r avatar Nov 30 '21 17:11 patric-r

Did some further investigating / debugging:

Interestingly,

session.getDataSetCacheManager().doesSaveToCache()

returns false. This getter is used within org.eclipse.birt.data.engine.executor.DataSourceQuery.execute(IEventHandler.

It looks like data set caching is not enabled which might explain above behavior. How to enable it? Why is disabled by default?

I see some references to "appContext" parameters like "org.eclipse.birt.data.cache.memory", but I have no glue how to set those parameters in the BIRT designer / preview viewer.

Any help is appreciated.

patric-r avatar Dec 01 '21 12:12 patric-r

I don't know, because I am not using any XML dataset, but maybe it can be set in the advanced properties?

grafik

And maybe the properties can be set as system properties in birt.ini?

hvbtup avatar Dec 01 '21 16:12 hvbtup

Nope, in my example project, this "needs cache for data-engine" property is set to true as well so it looks unrelated.

I found this document: https://www.eclipse.org/birt/release20specs/BPS7_Data_Set_Caching.pdf I'm confused that it states "design-time-caching", to include data set values in the report design which is not what I (and most people) want/need. It seems that this got removed, as the XML Dataset property window does not show the caching properties mentioned in this document anymore. But maybe the source has still some remains of it...

BTW, how to use a birt.ini within the development environment?

patric-r avatar Dec 02 '21 09:12 patric-r

I think you should try to debug this. Take a look at the DataSetCacheUtil class and its get*DataSetCacheConfig methods.

hvbtup avatar Dec 02 '21 09:12 hvbtup

By temporarily changing the preview viewer, I added context.put(DataEngine.MEMORY_DATA_SET_CACHE, "10000"); to the appContext.

This caused that caching has been enabled and session.getDataSetCacheManager().doesSaveToCache() returned true!

Unfortunately, this did not solve my performance issue. Reason: BIRT's dataSet cache is query based instead of data based. While this might fit nicely for database datasets, it does not solve my issue because I am using different xpath parameters for each dataSet access, causing the query to differ and in consequence, getting a cache miss.

What we would need here is a data cache - to fit the data source (in my case the 300 kb xml) in memory so that it only needs to be parsed once and executing the actual, individual query at the data stored in the cache. By looking further at the architecture, it seems to be difficult to solve without doing major changes to the data execution subsystem. Most probably, we have to avoid datatools at all for this.

patric-r avatar Dec 02 '21 16:12 patric-r

BIRT's dataSet cache is query based instead of data based.

To be more precise: If the same SQL query with the same combination of DataSet parameter values is accessed a second time from the layout, then the results are fetched from the cache.

While this might fit nicely for database datasets...

Yes. This prevents costly/slow queries to be sent to the database again - it is a very important performance feature.

Unfortunately that doesn't help in your case...

hvbtup avatar Dec 02 '21 17:12 hvbtup

The root cause is that the whole XML really has to be parsed again and again. This is because it is not possible to directly pass an XML subtree as input to a child DataSet (even better would be a parsed XML subtree). Instead, the only possible workaround seems to use a unique identifier of the parent row (in your case, the bookId) as a lexical parameter in an XPath expression that selects the detail rows from the whole XML.

/root/book[@bookId="{?bookId?}"]/page

The report could probably be much faster if one would use an XML parser to create a Java object structure from the XML once and then use POJO or scripted data sets to further process this Java object structure. But would probably be much more complicated.

hvbtup avatar Mar 30 '22 14:03 hvbtup

Thanks @hvbtup for your comment to which I mostly agree. We're already assigning unique identifier to parent rows to mitigate the fact the BIRT does not provide full XPath support.

You're right that a custom "XML to Java Object-Transformator" would solve most of the performance problems. However, this has to be done by every BIRT XML-DataSet user who is interested in performance (who's not?). Instead, we should think about if we can do better in BIRT as XML support is for me a common use case.

Ideally, we should think about using an alternative XML DataSource, e.g. a DOM-based one (of course, the users need to be careful when dealing with large xmls) and use standard JAXP-Features (e.g. XPath) executed on the DOM-tree instead of SAX-oriented parsing and custom xpath code.

A couple of months a ago, when preparing the sample project and collecting profiling data, I did some analysis but when I realized that the data source code base is quite complicated and full of surprising dependencies, I postponed this. IMHO, the datasource/ODA part of BIRT needs proper refactoring.

patric-r avatar Mar 30 '22 15:03 patric-r

I agree mostly.

I'd like to add that XML is an overkill format for data serialization in comparison to eg JSON. But XML ist still used a lot. OTOH, processing really big XML files seems not a reasonable use case IMHO - independent of the choice of tools and libraries.

The code base is definitely complicated - and not only the data engine part.

But let's face it: Until now, only you and I are particiating in this issue. And XML support is not important to me. So if you want to improve it, you should try it yourself.

Your idea with a DOM-based XML data source sounds quite reasonable to me. If I understand correctly, you mean that the XML is parsed once when the data source is opened. The data source then presents a DOM to the data sets. The data set can then use the DOM to navigate to those parts of the data in which it is interested.

It will need more memory to hold the DOM representation, but it should reduce processing time dramatically.

Even DOM is overkill in many case IMHO. There's also StAX and several different approaches in Java (see https://www.baeldung.com/java-xml). When processing XML with Python, I'm using ElementTree, which is more light-weight, and I always tried to avoid dealing with XML in Java, so I can't help much anyway.

hvbtup avatar Mar 31 '22 08:03 hvbtup