i-pi-dev_archive icon indicating copy to clipboard operation
i-pi-dev_archive copied to clipboard

Unable to Have More Than 1024 File Descriptors at Once

Open heindelj opened this issue 7 years ago • 10 comments

Hi,

For some context, I have hooked up a potential to the driver code which comes with i-PI as this is probably the easiest way to use a personal potential as far as I can see. This works fine, but I need to print a property for each bead (from the extras), and am also using more than 1024 beads in some simulations at very low temperatures.

When this is done, the following error is given:

Exception in thread poll_driver: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run self.__target(*self.__args, **self.__kwargs) File "/Users/hein071/research/i-pi-dev/ipi/engine/forcefields.py", line 166, in _poll_loop self.poll() File "/Users/hein071/research/i-pi-dev/ipi/engine/forcefields.py", line 260, in poll self.socket.poll() File "/Users/hein071/research/i-pi-dev/ipi/interfaces/sockets.py", line 674, in poll self.pool_update() File "/Users/hein071/research/i-pi-dev/ipi/interfaces/sockets.py", line 514, in pool_update readable, writable, errored = select.select([self.server], [], [], searchtimeout) ValueError: filedescriptor out of range in select()

To the best of my knowledge, this error is independent of whether a unix or inet socket is used, but I have noted that more than 1024 beads are possible if the extra files are not opened. I do not know if this is only a problem when using the driver interface, or if using e.g. LAMMPS for forces would have the same problem.

After doing some googling, this is a known limitation of select.select(). I don't think it is mentioned in the documentation I just linked, but it is noted in the NOTES sections at that site. Specifically, FD_SETSIZE is 1024 on linux systems, so select() can only monitor up to 1024 file descriptors at a time.

That being said, the problem can apparently be fixed with minimal changes by using select.poll() rather than select.select(), but I do not know if I can fix this properly myself, so I thought I would mention the problem here. I believe the only real changes needed are that whenever a new file descriptor is set, it needs to be registered using poll.register() and then select.poll() needs to be called rather than select.select().

To be clear, this is not a bug in i-PI but a limitation of the python (and hence underlying C) module select(), but there is a solution which can be implemented in i-PI with only minor changes using poll(). Unfortunately, the details have prevented me from being able to fix this myself.

heindelj avatar Aug 24 '17 18:08 heindelj

Uhm... I find puzzling that this is only triggered if you output the extras files -- I don't see how that should relate to the sockets side of the story. Is it possible that you're trying to use 1024 instances of the driver? That would not really be necessary as i-PI does some scheduling and can use the same driver for more than one bead - which is a good idea unless you have 1024 processors on a single node.

On 24 August 2017 at 20:37, heindelj [email protected] wrote:

Hi,

For some context, I have hooked up a potential to the driver code which comes with i-PI as this is probably the easiest way to use a personal potential as far as I can see. This works fine, but I need to print a property for each bead (from the extras), and am also using more than 1024 beads in some simulations at very low temperatures.

When this is done, the following error is given:

Exception in thread poll_driver: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run self.__target(*self.__args, **self.__kwargs) File "/Users/hein071/research/i-pi-dev/ipi/engine/forcefields.py", line 166, in _poll_loop self.poll() File "/Users/hein071/research/i-pi-dev/ipi/engine/forcefields.py", line 260, in poll self.socket.poll() File "/Users/hein071/research/i-pi-dev/ipi/interfaces/sockets.py", line 674, in poll self.pool_update() File "/Users/hein071/research/i-pi-dev/ipi/interfaces/sockets.py", line 514, in pool_update readable, writable, errored = select.select([self.server], [], [], searchtimeout) ValueError: filedescriptor out of range in select()

To the best of my knowledge, this error is independent of whether a unix or inet socket is used, but I have noted that more than 1024 beads are possible if the extra files are not opened. I do not know if this is only a problem when using the driver interface, or if using e.g. LAMMPS for forces would have the same problem.

After doing some googling, this is a known limitation of select.select() https://docs.python.org/2/library/select.html. I don't think it is mentioned in the documentation I just linked, but it is noted in the NOTES http://man7.org/linux/man-pages/man2/select.2.html sections at that site. Specifically, FD_SETSIZE is 1024 on linux systems, so select() can only monitor up to 1024 file descriptors at a time.

That being said, the problem can apparently be fixed with minimal changes by using select.poll() rather than select.select(), but I do not know if I can fix this properly myself, so I thought I would mention the problem here. I believe the only real changes needed are that whenever a new file descriptor is set, it needs to be registered using poll.register() and then select.poll() needs to be called rather than select.select().

To be clear, this is not a bug in i-PI but a limitation of the python (and hence underlying C) module select(), but there is a solution which can be implemented in i-PI with only minor changes using poll(). Unfortunately, the details have prevented me from being able to fix this myself.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cosmo-epfl/i-pi-dev/issues/197, or mute the thread https://github.com/notifications/unsubscribe-auth/ABESZ1QzGyrRsg_Za20msw3QUU1eAaUCks5sbcLxgaJpZM4PBxSM .

ceriottm avatar Aug 24 '17 19:08 ceriottm

I just checked again by running where I attempt to print the extras associated with 1536 beads, but only run 64 instances of the driver (2 nodes with 32 cores), and the same exception is raised. I believe it is because this error is not associated with the number of sockets open, but with the number of file descriptors total between all the sockets. Perhaps because all the file descriptors are handled by the i-PI instance, and the drivers never actually do any writing? (This is a guess as to what happens, so sorry if this is incorrect.)

So, from experience I can run as many instances of the driver code as I want, 1 per replica, but I cannot write to an arbitrary number of files. I thought this might be that I just had the ulimit set too low, but that is not the problem sadly. See, for instance, the NOTES documentation I linked above or this SO thread.

heindelj avatar Aug 24 '17 21:08 heindelj

Hi,

you can try to look what are the limits defined by your operating system using ulimit. The same command should also allow changing those limits. See if can help. Most probably this is a problem related to a single process opening more files (unixsocket or properties/trajectories) than the SO allows. ulimit shows and changes those limits (assuming you are using a unix-like SO).

grhawk avatar Aug 28 '17 08:08 grhawk

Hi @grhawk, I could reproduce this and I think that @heindelj is right, this is not ulimit-related. I don't understand why this gets only triggered when printing extras though - that has nothing to do with the socket machinery. Now, @heindelj honestly I do not see in the very near future us fixing a bug that is only triggered above 1024 beads (we're kind of focusing of non-PIMD use cases) but if you think you can substitute the select () call with a poll, I'd be very happy to review the bugfix and merge it. Also, if you plan to run with such a high beads n. make sure you get the PyFFTW libraries installed, or the normal-modes transformation will kill you in terms of performance.

ceriottm avatar Aug 29 '17 21:08 ceriottm

@ceriottm That's understandable. Honestly it's not a big deal because I only need to compute averages from what the extras prints so there's really no need to have all the files printed and I can just use fewer beads in the average. I believe I have seen you do this in a paper as well (the one with Felix Uhl).

I will have some free time in the next couple weeks and I'll see if I can fix this, even though it's really not much of an issue.

And thanks for the tip on pyFFTW. I have noticed a deterioration and was unsure of the cause.

heindelj avatar Aug 29 '17 21:08 heindelj

Something that would be hyper-useful and perhaps it's not too hard to implement is to be able to specify a range of beads in the trajectory outputs. I mean, you can already say <trajectory bead="0" ...> but it would be fantastic to say and have it to the right thing. Fancy some coding :-) ?

ceriottm avatar Aug 29 '17 21:08 ceriottm

I encountered this very thing yesterday! Instead I just used a loop on the command line and printed the same line 256 times with different bead numbers. Not the prettiest input file :)

I'm sure I could find a way to add that functionality.

heindelj avatar Aug 29 '17 21:08 heindelj

Something like a stride for imaginary time like we have for real time could be nice and very useful. The cost of printing xyz traj files also kicks in for very large number of beads which could be alleviated by this option.

On Aug 29, 2017 11:36 PM, "heindelj" [email protected] wrote:

I encountered this very thing yesterday! Instead I just used a loop on the command line and printed the same line 256 times with different bead numbers. Not the prettiest input file :)

I'm sure I could find a way to add that functionality.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cosmo-epfl/i-pi-dev/issues/197#issuecomment-325811501, or mute the thread https://github.com/notifications/unsubscribe-auth/AKQjG1MtqiYqicwCxw7rQ-xtG6xVZyZpks5sdIQ9gaJpZM4PBxSM .

venkatkapil24 avatar Aug 29 '17 21:08 venkatkapil24

Let me summarize the discussion and confirm the problem.

I could reproduce the issue by first increasing the ulimit: ulimit -n 2048 and then running with the following input input.txt : i-pi input.txt The output is the following:

Exception in thread poll_driver:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/pjuda/source/i-pi-dev/ipi/engine/forcefields.py", line 191, in _poll_loop
    self.poll()
  File "/home/pjuda/source/i-pi-dev/ipi/engine/forcefields.py", line 285, in poll
    self.socket.poll()
  File "/home/pjuda/source/i-pi-dev/ipi/interfaces/sockets.py", line 678, in poll
    self.pool_update()
  File "/home/pjuda/source/i-pi-dev/ipi/interfaces/sockets.py", line 514, in pool_update
    readable, writable, errored = select.select([self.server], [], [], searchtimeout)
ValueError: filedescriptor out of range in select()

Note that one needs ulimit greater than 1024, in order to get this error, otherwise, "too many open files" error is thrown. This is indeed a limitation of select() and it should be possible to overcome it using poll(), as suggested for instance here: https://stackoverflow.com/questions/14250751/how-to-increase-filedescriptors-range-in-python-select

So there are two tasks related to the issue:

  1. Overcome the limitation of select() - e.g. replace it with poll() and adapt the code.
  2. Implement feature allowing to print bead trajectories with some "stride", like (print trajectories of beads=0,5,10,15, ...)

pjuda avatar Dec 22 '17 15:12 pjuda

Some more input about the bug (1.).

The error is triggered in ipi/interfaces/sockets.py, line 514: readable, writable, errored = select.select([self.server], [], [], searchtimeout) because the filedescriptor is greater than 1024, which is an internal limitation of python. select accepts only filedescriptors which are in the limited range.

The filedescriptor is over the limit is due to the fact that in this example large number of output files is used. This results in a large descriptor for the socket, which is over the python limit for select.

Given my limited knowledge about socket machinery, I have not been able to fix the problem in the short time available. As reported, a suggested solution is to use poll() instead of select().

pjuda avatar Jan 30 '18 15:01 pjuda