basex
basex copied to clipboard
Open Multiple Databases for Multiple Clients
I have a quite large dataset (70GB of XML), and BaseX (version 9.1.2) stops indexing it by reporting the dataset is too large for a single database. Fine, I managed to work around that issue and split my dataset into subsets and indexing them.
Now to my problem. I want to launch a server and make multiple client requests. But every client needs access to the entire dataset, i.e., all databases in BaseX. The problem is that every client connection with the server is separated from other connections. So each client has to open all databases before requesting additional XQuery processing.
So I wonder if it is possible to avoid continuously open all databases for each client connection and open all databases once in the beginning when I start the server. But how can I do that? Is this even possible?
I appreciate any help.
PS: I have plenty of RAM (1TB). So that shouldn't be a problem.
Thanks for your feedback. You can access multiple database via a single XQuery expression. For general questions on BaseX, please write to our basex-talk mailing list: http://basex.org/open-source/. Thanks in advance.
I don't understand the rationality to split up the database. 70GB does not sound large to me. Why don't you just keep everything in one database?
The problem is the limit of nodes (2^31). See the statistics: Stats and Limits of BaseX
I don't mind splitting the dataset into multiple databases. However, I don't see any way to load multiple databases globally and not in the scope of single XQuery expressions. I know it is possible to open multiple databases in an XQuery expressions, but these databases are only opened in the scoped of this specific XQuery. In consequence, every request (XQuery) to the server has to open all databases which create a huge overload (and it takes ages).
I expecting there is an option like OPEN db
but for multiple databases. But it seems this is impossible.
@AndreG-P just take a smaller dataset. I think we should not customize our queries to the specifics of BaseX. Hopefully, future versions of BaseX will support larger files. @ChristianGruen is there hope for that?
Indeed it’s quite common with BaseX to distribute large dataset across multiple databases. Databases pretty light-weight entities, comparable to collections in other XML databases.
Opening a large database shouldn’t lead to long delays, so we might need to investigate this further.
The following command-line call will output the average time for opening and closing one of your database 100 times. Could you please run it and report the result back to us?
basex -v -r100 "prof:void(db:open('name-of-your-db'))"
Some additional information might be helpful:
- What is the total number of nodes in your databases?
- How many documents are contained in the databases?
@ChristianGruen thank you for helping us.
I used a smaller dataset now (around 15% of the original dataset). This time BaseX was able to index the entire dataset in one index. So the following results belong to this smaller dataset.
I tried to evaluate your query to test the time. However, it takes ages. So I just open the BaseX CLI and entered the query XQUERY prof:void(db:open('name-of-your-db'))
. Every time I evaluate it, it takes about 51 seconds (+/- 2 seconds) which is pretty long. Usually, I would use the OPEN name-of-my-database
command which evaluates in around 800ms (+/- 200ms). I don't understand why the XQuery version takes so much longer.
And here are the stats of this smaller dataset:
SIZE: 7411 MB
NODES: 407935813
DOCUMENTS: 135918
My original problem splitting the database into multiple parts is that I may request specific documents and don't know in which database this document will be. That's why I wanted to open all databases for all clients. Probably I have to wrap my head around another approach.
There should be no difference indeed between OPEN or db:open. Does the problem persist if you optimize your database via OPTIMIZE ALL?
Your approach is fine, it’s quite a common approach with BaseX to open all databases via a single query.
And I assume that you are getting the same results with the new 9.2 release?
Sorry for closing this issue just in the beginning. Github is used to report confirmed bugs or feature requests. Regular discussion on BaseX happens via our mailing list. For future requests, I invite you to register for our list.
I'm not sure if it is really a bug. I have the feeling prof:void(db:open('mydb'))
takes a very long time because the result of db:open
will be written to an output stream anyway, but prof:void
don't show it. When OPEN mydb
does not write anything to an output, it must be way faster. Could that cause the delay?
OPTIMIZE ALL
took 10min and didn't change anything (as far as I can see). The performance is about the same as before. I tested it with the new BaseX 9.2 release, and the results are almost the same. BaseX9.2 is a bit faster though.
Back to my original question, there is no global option like OPEN db
to open multiple databases right? And if I open multiple databases in the scope of an XQuery, they will be closed once the XQuery finished. There is no way to keep it open for later XQuery requests, right?
I'm not sure if it is really a bug. I have the feeling prof:void(db:open('mydb')) takes a very long time because the result of db:open will be written to an output stream anyway, but prof:void don't show it.
prof:void
does a bit more indeed, but the results won’t be serialized by that function. Instead, they will only be iterated through. In my tests, this took little longer then the OPEN command, but the difference shouldn’t be that huge (I created a test database with 500,000 dummy documents; prof:void(db:open(...))
took less than a second).
Well, it seems to be huge in your setup. We might eventually need access to your data sets to tell more, it seems to be pretty hard to reproduce the behavior you reported. Maybe you dump a listing of the database files? What’s the size of the inf.basex
file? If it’s fairly large, it could indicate that your documents have a lot of namespaces (if you don’t need namespaces, you can enable the STRIPNS option before creating the database).
there is no global option like OPEN db to open multiple databases right?
No, there isn’t. One such option could ensure that opened databases will stay opened until BaseX is shut down, or until a database is recreated. My hope would be that we can keep the time for opening databases low.
What’s the size of the inf.basex file?
I divided the dataset into 24 subsets now. I checked inf.basex
in the largest subset, and it has a size of 585M
. Is that big? tbl.basex
has a size of 26GB
.
If it’s fairly large, it could indicate that your documents have a lot of namespaces (if you don’t need namespaces, you can enable the STRIPNS option before creating the database).
Well, we have only two different namespaces but they are nested. Here is a small dump of a typical file in the database. As you can see we have two namespaces. Would that be a problem?
<mws:harvest
xmlns:mws="http://search.mathweb.org/ns">
<mws:expr><math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow></math></mws:expr>
<mws:expr><math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>x</mi><mo>∈</mo><msubsup><mi>ℝ</mi><mrow><mi/><mo>+</mo><mo>+</mo></mrow><mi>n</mi></msubsup></mrow></math></mws:expr>
...
<mws:expr><math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mi>p</mi><mi>y</mi></msub><mo>=</mo><mrow><msub><mi>p</mi><mn>1</mn></msub><mo>+</mo><mrow><mn>2</mn><mo></mo><msqrt><mrow><msub><mi>p</mi><mn>2</mn></msub><mo></mo><msub><mi>p</mi><mn>3</mn></msub></mrow></msqrt></mrow></mrow></mrow></math></mws:expr>
</mws:harvest>
And I have 800k of those files in total.
No, there isn’t. One such option could ensure that opened databases will stay opened until BaseX is shut down, or until a database is recreated. My hope would be that we can keep the time for opening databases low.
I see. Unfortunately, this was precisely what I needed for my experiments. :)
I was able to solve this issue by myself, even though my solution is probably very inefficient. For each database, I open a new BaseXServer now (so I end up running 24 BaseXServers in parallel). Each server opens one database with OPEN mydb
. My requests will be redirected to the right server by small logic I implemented. It worked pretty well and quite fast. Even though it was very RAM consuming.
Using multiple BaseX instances is definitely one way to go. We’ll have some more thoughts which changes would all be required to keep databases open.
I divided the dataset into 24 subsets now. I checked inf.basex in the largest subset, and it has a size of 585M. Is that big?
That is big indeed. This document is fully parsed if a database is opened; probably that’s also the reason for your high RAM consumption. I am surprised that the OPEN command was processed that fast on your system.
It would be interesting to hear if your inf.basex
shrinks if you enable STRIPNS? If not, it could be the database statistics, and you could assign small values (maybe even 0) to the MAXLEN and MAXCATS) options.
It might be enough to let the db client declare the org.basex.core.Context.
Example code: https://github.com/axxepta/basex-multitenant/blob/master/src/main/java/de/axxepta/basex/App.java
The main question is if there are any side effects to switch the Context from one query to another.
Discarded (out of scope).