tangos icon indicating copy to clipboard operation
tangos copied to clipboard

issues with using more than one mpiproc per node

Open mtremmel opened this issue 7 years ago • 6 comments

When I run a calculation using 50 processes, with 1 process per node things seem to work alright. However, running the same calculation on the same simulation steps/halos with 2 processes per node and 25 nodes (rather than 50), I quickly get a database error. I'm not sure if this is a problem with this specific filesystem or not.... The following is the database error that comes up. It happens quickly enough that it seems to be simply when the code is attempting to read in the existing database properties. This is all done on nobackupp2 on pleiades and using the most up-to-date master branch (except the very latest updates in the past 12 hours or so).

sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) database disk image is malformed [SQL: u'SELECT creators.id AS creators_id, creators.command_line AS creators_command_line, creators.dtime AS creators_dtime, creators.host AS creators_host, creators.username AS creators_username, creators.cwd AS creators_cwd \nFROM creators \nWHERE creators.id = ?\n LIMIT ? OFFSET ?'] [parameters: (234, 1, 0)]

mtremmel avatar Nov 21 '18 18:11 mtremmel

Yes that's almost certainly a filesystem bug (although goes slightly in the opposite direction to what you'd expect, but I have seen weirder). As ever, we should not be using SQLite at this scale. But maybe post the full traceback in case there is some identifiable lock we can put in.

apontzen avatar Nov 21 '18 19:11 apontzen

The problem is that all of my processes die at once so the traceback is all jumbled... not sure how useful it is. I could just attach the full error file?

mtremmel avatar Nov 21 '18 19:11 mtremmel

I think there's some mpirun / mpiexec variant that will label which line came from which processor?

apontzen avatar Nov 21 '18 19:11 apontzen

Interesting... right now each "normal" (non-traceback or error message) line does show the processor number, but the traceback doesn't...

mtremmel avatar Nov 21 '18 19:11 mtremmel

Yes the 'normal' lines are labelled by tangos itself but it can't do that for the traceback. I think though mpirun can do it... you just need to find the right incantation...

Or alternatively, just do a manual detective work ;-)

apontzen avatar Nov 21 '18 19:11 apontzen

I'll look into it

mtremmel avatar Nov 21 '18 19:11 mtremmel