scipion On optimization of access to hard disk. xmipp-extract-particles (and in geenral any extract particle protocol)

Describe the bug xmipp extract particles is in launching state for two hours before starting the first step

To Reproduce Steps to reproduce the behavior:

project 2018_10_10_rocio-arranz_cct_oct2018 (jaime)
project contains around 6000 movies
execute xmipp extract particles
before creating the first step the protocol import the coordinates (in this case 1.500.000). This takes 90 minutes. during these 90 minutes the state of the protocol is launching and the user does not get any feedback.

Coments

I have modified the protocol so it prints a line each 50 processed movies. Nevertheless this is an step that should be optimized. Not sure this affects when running in streaming mode

If I repeat the experiment with 200 movies (and 48000 coordinates) it takes 9 seconds
processing the first 200 movies using the 6000 movies set takes more than 2 minutes. So accessing to the pos files when there are many is very time consuming.

Suggestions

Reduce the number of accesses to disk -> batch programming, process several pos files together

Oct 17 '18 09:10 rmarabini

yep, relion had the same issue before batch option was added

Oct 17 '18 09:10 azazellochg

Yes, the issue is with any extract...and, unfortunately, the batch option does not complete solve the issue.

When not in streaming, if the input coordinate set is big, then the checkNewInput to generate the steps take quite long time and it is not even extracting particles yet (in launched mode). I think a workaround is need to allow that function to yield some partial results (that will be extracting) and then continuing parsing more input and scheduling more jobs. Certainly the "batch" option helps here anyway.

Oct 17 '18 09:10 delarosatrevin

I met Jaime yesterday....and mention something related also to mpi? Like with 1 mpi all works?

Oct 18 '18 07:10 pconesa

Hi Pablo,

This is another problem. The situation is like this, if you use many

MPIs -six in jaime's case- the extraction sometimes crashes. My guess is that many processes are competing for the hard disk (movies, particles and database access) and if the number of MPIs is too high at some point there is a timeout and the process crash. Of course this problem only appears if you use more than a single CPU. In addition to this "crash" problem we have the one I described: when processing a full dataset with many coordinates the user needs to wait hours before the protocol stars to execute the first step. I modified the protocol so it now prints a line -each 50 micrographs- informing the user that the protocol is reading the coordinates for the different micrographs. Of course this is dirty patch, not a solution.

Roberto

On Thu, Oct 18, 2018 at 9:01 AM Pablo Conesa [email protected] wrote:

I met Jaime yesterday....and mention something related also to mpi? Like with 1 mpi all works?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/I2PC/scipion/issues/1798#issuecomment-430897618, or mute the thread https://github.com/notifications/unsubscribe-auth/AKBpog_Y-AXCttN6on0yrpDkgKF_Ircqks5umCdigaJpZM4XjfhH .

Oct 18 '18 11:10 rmarabini

Hi,

I think I have an idea about how speed up things. I have executed the extract particles program starting from the whole data set and from a subset containing 200 micrographs. The results is that processing 200 micrographs take an order of magnitude more when we use as input the whole data set rather than the 200 micrographs subset. From this fact I conclude that searching in the coordinates.sqlite file is taken a lot of time (even more than creating the python objects). I do not know what is the query but my guess is that the objects table does not have any index at all. It would be nice if the program that creates the coordinates.sqlite file would create also an index for the micID (or the micName, I do not know what we are using). Does this sound reasonable?

Roberto

On Thu, Oct 18, 2018 at 1:32 PM Roberto Marabini [email protected] wrote:

Hi Pablo,
This is another problem. The situation is like this, if you use many
MPIs -six in jaime's case- the extraction sometimes crashes. My guess is that many processes are competing for the hard disk (movies, particles and database access) and if the number of MPIs is too high at some point there is a timeout and the process crash. Of course this problem only appears if you use more than a single CPU. In addition to this "crash" problem we have the one I described: when processing a full dataset with many coordinates the user needs to wait hours before the protocol stars to execute the first step. I modified the protocol so it now prints a line -each 50 micrographs- informing the user that the protocol is reading the coordinates for the different micrographs. Of course this is dirty patch, not a solution.

Roberto

On Thu, Oct 18, 2018 at 9:01 AM Pablo Conesa [email protected] wrote:

I met Jaime yesterday....and mention something related also to mpi? Like with 1 mpi all works?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/I2PC/scipion/issues/1798#issuecomment-430897618, or mute the thread https://github.com/notifications/unsubscribe-auth/AKBpog_Y-AXCttN6on0yrpDkgKF_Ircqks5umCdigaJpZM4XjfhH .

Oct 18 '18 12:10 rmarabini

It does, actually for other methods we had similar problems and some parameters where added to the Sets, like "where" that combined with the last micIdprocessed could help in all the loops. At least this is certainly true for 2D classes in streaming and other streamified protocols that ldeal with set of particles.

Oct 18 '18 12:10 pconesa

@rmarabini I think before working in any workaround, it will be better to do some profiling to detect where the current bottleneck is. I haven't done it neither, but my gut filling is that it is more with the construction of many Python objects (from the sqlite rows) than with the query.

Here is where coordinates are iterated, only those belonging to a given mic: https://github.com/I2PC/scipion/blob/master/pyworkflow/em/protocol/protocol_particles.py#L399

micId is already a numeric value, so I don't think with an index we will gain much, because it is not sorting it.

Oct 18 '18 12:10 delarosatrevin

Hi,

I made a little experiment and the index made a huge difference. I describe it:

Using the same data as described earlier (6000 micrographs with 1500000 coordinates) I:

made a backup copy of the file coordinates.sqlite
connect to the database: sqlite3 coordinates.sqlite
create two indexes create index c03index on objects(c03); create index c04index on objects(c04);

I created two indexes since I do not know if we are using the micrograph index (c03) or the micrograph name (c04). Index creation time: 3 seconds.

copy the protocol xmipp-extract-particles and rerun it.
- Time to read the coordinates: 0:05:32.039221 (5 minutes and a half)
restore the original coordinates.sqlite file
rerun the protocol
- Time to read the coordinates: It has not finished yet but in 5 minutes it have read the coordinates of 325 micrographs so my guess is that it will spend almost two hours reading.

So the improvement is more than one order of magnitude.

I did not understand the argument regarding indexes and sorting. As you said we do not sort the data but indexes help with the queries in particular with the query "give me all coordinates for a micrgraph with this id". If there is no index the database need to read the whole table (that is the search is proportional to the number of coordinates) if there is and index the search is proportional to log(number of coordinates)

Another interesting possibility for queries that repeat very often is to use prepared statements. for a normal query the database needs to (1) parse it, (2) convert SQL to operations defined in the computer and (3) execute it. If we have 6000 microgrpahs we need to repeat the same query (except for the micID) 6000 times. It is possible to create a prepared statement and reuse it. In this case the database will perform (1) and (2) only once and reuse it for the 6000 almost identical queries. If somebody is interested in implementing such a beast I can test the difference between using and not using prepared statements. Since the testing will require some work if they are not volunteers for implementing the prepared statements I would rather not test it.

Roberto On Thu, Oct 18, 2018 at 2:48 PM Jose Miguel de la Rosa Trevin [email protected] wrote:

@rmarabini I think before working in any workaround, it will be better to do some profiling to detect where the current bottleneck is. I haven't done it neither, but my gut filling is that it is more with the construction of many Python objects (from the sqlite rows) than with the query.

Here is where coordinates are iterated, only those belonging to a given mic: https://github.com/I2PC/scipion/blob/master/pyworkflow/em/protocol/protocol_particles.py#L399

micId is already a numeric value, so I don't think with an index we will gain much, because it is not sorting it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Oct 18 '18 15:10 rmarabini

Interesting finding @rmarabini , maybe I'm quite wrong about the indexes and the speed-up. Maybe we should discuss how to allow them in a general way in the context of Sets.

I would love to explore the prepared statement (I'm not sure if the Python binding do something of that in the middle), but I don't have the time now. Still sometime to put in the TODO list somewhere.

Oct 18 '18 15:10 delarosatrevin

OK, I will test the use of prepared statement and see if they improve the speed

Roberto

On Thu, Oct 18, 2018 at 5:43 PM Jose Miguel de la Rosa Trevin < [email protected]> wrote:

Interesting finding @rmarabini https://github.com/rmarabini , maybe I'm quite wrong about the indexes and the speed-up. Maybe we should discuss how to allow them in a general way in the context of Sets.

I would love to explore the prepared statement (I'm not sure if the Python binding do something of that in the middle), but I don't have the time now. Still sometime to put in the TODO list somewhere.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/I2PC/scipion/issues/1798#issuecomment-431059432, or mute the thread https://github.com/notifications/unsubscribe-auth/AKBpouudGCaolksE5QJz5GA_oTYKsxUXks5umKGjgaJpZM4XjfhH .

Oct 18 '18 15:10 rmarabini

Hi,

Checking the performance of prepared statement was easier than what I though.

The documentation says:

The sqlite3 https://docs.python.org/3/library/sqlite3.html#module-sqlite3 module internally uses a statement cache to avoid SQL parsing overhead. If you want to explicitly set the number of statements that are cached for the connection, you can set the cached_statements parameter. The currently implemented default is to cache 100 statements.

So we are already using prepared statements

The manual also claims that the optimal way to forms a query is:

cur.execute("select md5(?)", ("foo"))

rather than

cur.execute("select md5(%s)" % "foo")

hope this helps

Roberto

On Thu, Oct 18, 2018 at 5:46 PM Roberto Marabini [email protected] wrote:

OK, I will test the use of prepared statement and see if they improve the speed
Roberto
On Thu, Oct 18, 2018 at 5:43 PM Jose Miguel de la Rosa Trevin < [email protected]> wrote:

Interesting finding @rmarabini https://github.com/rmarabini , maybe I'm quite wrong about the indexes and the speed-up. Maybe we should discuss how to allow them in a general way in the context of Sets.

I would love to explore the prepared statement (I'm not sure if the Python binding do something of that in the middle), but I don't have the time now. Still sometime to put in the TODO list somewhere.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/I2PC/scipion/issues/1798#issuecomment-431059432, or mute the thread https://github.com/notifications/unsubscribe-auth/AKBpouudGCaolksE5QJz5GA_oTYKsxUXks5umKGjgaJpZM4XjfhH .

Oct 18 '18 18:10 rmarabini

I think this is already addressed, right @rmarabini ?

Aug 08 '19 14:08 pconesa