pyepics icon indicating copy to clipboard operation
pyepics copied to clipboard

ca.clear_channel should be called on PV.disconnect

Open klauer opened this issue 5 years ago • 25 comments

Current master (and all previous released versions) appear to leave a CA channel dangling without reference in Python - until the interpreter exits.

Need to test that adding this causes no new issues or instability.

klauer avatar Apr 30 '19 17:04 klauer

Hello. Is there any reason why this hasn't been added yet?

At ESS we are struggling with memory leaks that we think are caused by this. Adding a call to ca.clear_channel() after PV.disconnect() seems to solve the issue for us.

juanfem avatar Mar 01 '21 10:03 juanfem

Hi again,

I think I understand the issue now.

I assume that ca.clear_channel() is not called because there can be more than one PV object connecting to the same channel and pyepics is caching that connection. Therefore, PV.disconnect() should only call ca.clear_channel() when there are no PV objects using it.

However, here the reference to that channel is removed from ca._cache by calling ca._cache.pop().

I think a solution could be to keep a list of all the PV objects subscribing to the same channel and only remove the channel from ca._cache and call ca.clear_channel() when the las PV object calls PV.disconnect().

I can volunteer to create a pull request if needed.

Cheers, Juan

juanfem avatar Mar 01 '21 12:03 juanfem

@juanfem Before working on a solution, it would be helpful to understand the problem you are seeing. When you say "memory leaks" do you mean that memory usage keeps growing even though no resources (say, PVs) are explicitly allocated, or do you mean that trying to purge PVs from a program may leave some dangling connections?

Diagnosing memory issues for Python programs can be challenging. Diagnosing PVs being left partially in the pyepics caches ought to be easier to diagnose. For sure, I and many other people run long-running pyepics-heavy applications. I'm skeptical that there are "real" memory leaks, but I'll also admit that most of my long-running pyepics apps will connect to a set of PVs (perhaps only 25, perhaps a few thousand) and don't try to disconnect from PVs very often -- a Display Manager may very well want to connect/disconnect from PVs.

So, can you describe the actual problem you are seeing before trying to change the code?

newville avatar Mar 02 '21 00:03 newville

Hi, I work with @juanfem and I can post here the code. The main idea is to have something that returns all the details of a list of PVs in a FastAPI application. So the function is this:

@app.get("/api/v1/caget/{pvlist}")
async def get_pv(pvlist: str):
    pvnames = pvlist.split(",")
    pvs = []
    for pvname in pvnames:
        pvs.append(epics.get_pv(pvname))
    time.sleep(1)

    result = {}
    for pv in pvs:
        result[pv.pvname] = pv._args
        if pv.connected:
            pv.clear_auto_monitor()
            pv.clear_callbacks()
            pv.disconnect()
    return result

The list is garbage collected, but the PVs remain unattended and the memory continue to grow until it reaches the 100%. In other words I cannot find a way that is safe to remove a PV class from memory if unused because of the cache mechanism. I understand that caching is useful in most of the applications but a way to stop the caching and release the memory for a PV can be an interesting feature.

emanuelelaface avatar Mar 02 '21 07:03 emanuelelaface

Hi @newville, The issue arises when you create a new PV object and then call its disconnect method. Because ca.clear_channel() is not called, the resources are never released by the CA library.

In the snippet below you can see that the memory continuously increases every time a new PV object is created, even if it tries to connect to the same EPICS PV (CA library allows that).

import epics
import os
import psutil

pv_name = "fake"

process = psutil.Process(os.getpid())

print(process.memory_info().rss)

for i in range(int(10)):
    for i in range(int(1e3)):
        pv = epics.PV(pv_name, auto_monitor=True)
        pv.disconnect()
        
    print(process.memory_info().rss)

Adding a call to ca.clear_channel() solves the issue.

import epics
import os
import psutil

pv_name = "fake"

process = psutil.Process(os.getpid())

print(process.memory_info().rss)

for i in range(int(10)):
    for i in range(int(1e3)):
        pv = epics.PV(pv_name, auto_monitor=True)
        cache_item = epics.ca._cache[epics.ca.current_context()].get(pv.pvname, None)
        pv.disconnect()
        epics.ca.clear_channel(cache_item.chid)
        
    print(process.memory_info().rss)

However, this is not a valid solution since more than one PV object can be connected to the same channel, and ca.clear_channel will disconnect all PV objects. That's why I suggest to add a count on the PV objects connected to the same signal and clear the channel when all PV objects have called the disconnect method.

juanfem avatar Mar 02 '21 09:03 juanfem

@emanuelelaface @juanfem A couple of comments.

First, the best advice I can give to make your code and life better is probably "don't do that". In CA (aside from Python), channels/PVs are relatively expensive to create and connect and very cheap to keep around. For Python, automatic garbage collection will throw out connection handles and state information unless kept in a global (to the process and Thread if you are using those). So pyepics needs to maintain a global cache of connection information. If your application is creating and connecting to PVs that you might use again, it is probably not a good idea to disconnect them only to reconnect later.

In both example codes, why not just take out the disconnection code and leave the PVs connected? What are you expecting the disconects to do? For the web app, you might consider creating a separate process that keeps and maintains a cache of connected PVs and have your web code fetch values from that store. I use a relational DB for such a store in this app: https://millenia.cars.aps.anl.gov/pvarch/ so that the code run by the web server process does not create any Epics PVs at all. It happens to use Python+Flask for the web interface, but by using a DB cache of PV values, the values can also be fetched from Javascript or other tools. The web server does maintain a persistent connection to the relational DB per process, but that is a common approach and the libraries for "web+DB" applications are well-engineered.

Second, it is not entirely clear to me what epics.PV.disconnect() should do. Again, because of the global cache, and potential for multiple python epics.PV objects pointing to the same channel/PV address, each with potentially different callbacks, it's not at all clear that epics.PV.disconnect() should call epics.ca.clear_channel(). In fact, this is currently in the code but commented out. It might be that this could be improved, but it should be done with care and with complex use cases in mind. So: what do you think epics.PV.disconnect() should do? I think it is not at all obvious.

newville avatar Mar 02 '21 16:03 newville

I am a bit confused by the solution that you propose, because I used exactly what you propose here in the past but what I am trying to do is something different.

If the PV to monitor and expose are known in advance, I see the use of a DB to cache the values etc. and I used this idea in a previous version of my code. But then I decided that it would be more useful to have a general purpose web interface to the PVs, that can query any PV (actually my interface calls also the archiver and the channel finder if needed) in order to develop not only web applications but any kind of application abstracting the EPICS protocol in something a bit more general purpose.

In this scenario, I don't understand the use of the DB and caching. On the backend I can have millions of PVs and the users can request any kind of PV list to the server, so I cannot do a database of all the PVs caching their values but I want simply to serve the value of the requested PV until the user is connected and stop to look at that PV when the user leaves the connection (or after a timeout).

In this scenario I need to create and destroy the connection with EPICS otherwise in time I will run out of resources in terms of memory and bandwidth because the more query I receive the more PVs I will cache. That is why I was wondering about a mechanism that can stop to cache, even after a timeout is fine, but the idea that the connection (and the data structure) remains indefinitely in the cache is, in my opinion not optimal for those applications that need to run for long time and query the EPICS infrastructure periodically.

emanuelelaface avatar Mar 02 '21 17:03 emanuelelaface

Sorry, I think the discussion is drifting away from the original issue... I believe the discussion on how my colleague should/shouldn't implement his code belongs to other forums (or at least to another issue), so I suggest we come back to the original issue.

You ask what I think the method disconnect should do. It is a reasonable question, because the only documentation I could find says "disconnect a PV, clearing all callbacks." But before answering to that question, I'd rather start by "what it should NOT do." I copy here the snippet in question:

  def disconnect(self):
      "disconnect PV"
      self.connected = False

      ctx = ca.current_context()
      pvid = (self.pvname, self.form, ctx)
      if pvid in _PVcache_:
          _PVcache_.pop(pvid)

      cache_item = ca._cache[ctx].pop(self.pvname, None)
      if cache_item is not None:
          if self._monref is not None:
              # atexit may have already cleared the subscription
              self._clear_auto_monitor_subscription()

          # TODO: clear channel should be called as well
          # ca.clear_channel(cache_item.chid)

      self._monref = None
      self._monref_mask = None
      self.clear_callbacks(True, True)
      self._args = {}.fromkeys(self._fields)
      ca.poll(evt=1.e-3, iot=1.0)

The method is removing the channel from ca._cache, but it is not releasing the resources. The connection created by the CA library is not referenced in the Python code anymore. This is a memory leak, and my advice is that it should be fixed. It is clearly an undesired feature, because the only way to free those resources is to stop the Python interpreter.

Depending on the interpretation of what the disconnect method should do, I see two options here:

  1. Only clear the monitor subscription and callbacks. In that case, it should NOT remove the channel from ca._cache. Successive creation of PV objects connecting to the same EPICS PV will reuse that connection.
  2. Call ca.clear_channel(), taking the precaution that no other PV object is using this connection. This is what I think it should do, because it is what the method name tells me. Also, since the disconnect() method is called from the del() method, I would assume it should release all resources. If that is not the intended behavior, it would be advisable to add a comment in the documentation.

With the current implementation, the best is to avoid using the disconnect method and instead call _clear_auto_monitor_subscription() and clear_callbacks(). This is what you are suggesting, but then maybe it is a good idea to deprecate this method and add this information in the documentation. Also, adding a method that combines _clear_auto_monitor_subscription() and clear_callbacks() could be helpful.

juanfem avatar Mar 02 '21 20:03 juanfem

@juanfem Well, I think it is fair to understand both why you would want to call PV.disconnect() and what you expect that to actually do. Just to be clear, this library is pretty stable and in production for many people, so we are more concerned about inadvertent breakage of existing code than we are about fixing perceived and theoretical flaws. For sure, there are ways to use this library that will get a response of "don't do that, do this instead".

That said, I don't exactly know why the ca.clear_channel() is commented out of PV.disconnect(). Does un-commenting that line fix your problem? I would also ask @klauer if he has an opinion here. I agree that if the cache entry is removed, calling ca.clear_channel() is probably the right thing to do, and I agree that it is somewhat unclear what unintended consequences that might have.

I guess that if this fixes your issue, that it is reasonable to assume that we should un-comment that line. Anyone else have opinions on this?

newville avatar Mar 02 '21 23:03 newville

@newville I agree with your reasoning and am grateful to you for trying to help my colleague find a solution. However, in this case this seems to be a bug, reported by another person 2 years ago.

Uncommenting the line does not seems a good idea to me. The reason is that it will affect all PV objects connected to the same channel. In fact, it seems they will raise exceptions in that case.

Given your concerns about stability of the library and possibly adding bugs, then I would suggest to go with the first of the 2 options I mentioned above, which is the one that implies less changes. In fact, that should only require removing lines 1121 and 1122. The method _clear_auto_monitor_subscription() already does this:

self._monref = None
self._monref_mask = None

so it might be safe to remove lines 1130 and 1131 for clean-up reasons, although they are not harmful.

With these changes, the CA connection will be kept open and cached on the Python side, but I assume that should not be a problem for anyone. Creating a new PV object to connect to the same channel won't allocate more resources, which could be an advantage in terms of performance as you described above.

juanfem avatar Mar 03 '21 13:03 juanfem

@juanfem um, @klauer is a long-time contibutor and the original report is sort of "we need to figure this out....". So, um, you can call this a bug if you want.... From my point of view, we're trying to make code that works for as many people as possible.

I continue to be confused by what you want or expect to happen here. Three days ago you said "Hello. Is there any reason why this hasn't been added yet?" The "this" in that sentence must be this Issue: that is what you commented on. And, just be obvious, the issue is, in its entirity

Current master (and all previous released versions) appear to leave a CA channel dangling without reference in Python - until the interpreter exits.

Need to test that adding this causes no new issues or instability.

So this topic under discussion here is exactly about the CA channels being left dangling. And now you say that uncommenting the line to clear the channel seems like a bad idea and that that epics.PV.disconnect() should not even remove the entry from the cache. That will definitely leave resources dangling. To me, dropping the entry from the cache seems to be exactly what epics.PV.disconnect() should do. Like, one could turn off the monitor callbacks separately.

As I said earlier, it is a little unclear how Python should handle the concept of a user wanting to disconnect PVs from within a single process. I am sure there is room for interpretation on what should be done.

Hope that helps clarify why we have not reached "obvious resolution" here: I think we have not actually identified "the" problem.

newville avatar Mar 04 '21 00:03 newville

The title of the issue says "ca.clear_channel should be called on PV.disconnect." I fully agree with that statement, and this is what I was suggesting in the first place. However, just uncommenting the line calling ca.clear_channel is not a solution, since it has side effects. It requires more code.

Then you started the philosophical discussion about what disconnect should do. I pointed at two possibilities that come to my mind, and I showed my preference for the second option that calls ca.clear_channel after taking some precautions.

Call ca.clear_channel(), taking the precaution that no other PV object is using this connection. This is what I think it should do, because it is what the method name tells me. Also, since the disconnect() method is called from the del() method, I would assume it should release all resources. If that is not the intended behavior, it would be advisable to add a comment in the documentation.

To be honest, I have no strong preference here, and I would be happy even if you choose the other interpretation. Or any other that comes to your mind, as long as it is documented and bug-free.

I understood you are reluctant to make relatively big changes to the code, worrying that it may introduce new bugs. That is something I completely understand, and that is why in my last comment I suggested a minor modification. If I misunderstood and you agree to go for my preferred solution (clearing the CA channel), I would be happy to contribute to the code and create a pull request, but you'd need to agree to do some testing also on your side.

You say

I think we have not actually identified "the" problem.

And I have to disagree. The problem is clear to me, so let me explain why the solution I suggested above fixes the memory leak without clearing the CA channel. Paraphrasing again @klauer

Current master (and all previous released versions) appear to leave a CA channel dangling without reference in Python - until the interpreter exits.

The key is the reference in Python. If you keep the reference to the CA channel in Python (in _cache in ca.py), even if you don't clear the channel you make sure you always use the same CA channel every time you instantiate a new PV object with the same PV name. Right now you don't, and if you connect and disconnect to the same PV 1 million times, you will have 1 million CA channels created without references to them in Python (only a reference to the last one created).

There is a drawback. With this solution, every time you want to connect to a new PV, the first time you create a PV object it will allocate more resources that are never released. However, this is far less dangerous than the current situation, in my opinion, because you would need to try to connect to random PVs to create issues. If your machine has, say 1 million PVs, you are not supposed to allocate more than 1 million CA channels. You consider that this is leaving resources dangling, but that is not 100% true. You still have a reference from Python to those resources and you could eventually release them, if there was a method to call ca.clear_channel (which to me, it should be disconnect).

I hope the issue is a bit clearer now. Let me know if you want to fix this, and if you need any support from my side.

juanfem avatar Mar 04 '21 10:03 juanfem

After some testing, I realized that there was another cache that was keeping references to methods of PV objects, thus preventing them being garbage collected. Those references are stored in a _CacheItem object. So the fix requires a few extra lines. However, it is still simpler than a fix that calls ca.clear_channel.

The method would look like this:

@_ensure_context
def disconnect(self):
    "disconnect PV"
    self.connected = False

    ctx = ca.current_context()
    pvid = (self.pvname, self.form, ctx)
    if pvid in _PVcache_:
        _PVcache_.pop(pvid)

    cache_item = ca._cache[ctx].get(self.pvname, None)
    if cache_item is not None:
        if cache_item.callbacks.count(self.__on_connect):
            cache_item.callbacks.remove(self.__on_connect)
        if cache_item.access_event_callback.count(self.__on_access_rights_event):
            cache_item.access_event_callback.remove(self.__on_access_rights_event)

        if self._monref is not None:
            # atexit may have already cleared the subscription
            self._clear_auto_monitor_subscription()

    self._monref = None
    self._monref_mask = None
    self.clear_callbacks()
    self._args = {}.fromkeys(self._fields)
    ca.poll(evt=1.e-3, iot=1.0)

I can open a pull request if you wish. I can also work on a solution that calls ca.clear_cache if preferred.

juanfem avatar Mar 10 '21 21:03 juanfem

@juanfem yeah, I'm still confused by what you expect PV.disconnect to do. Earlier you said that you thought that calling ca.clear_channel() was a bad idea.... but that is precisely the original issue. Hm... You also said that "the problem" was clear to you, but let me assure you, neither "the problem" nor "the solution" is clear to me. Given that @klauer also had a similar take when raising this issue, I am left unsure whether you actually understand the issues involved. Sorry, but I actually don't know who you are (real names are very helpful) or what you work on. As far as I can tell, you've never added code to this project before. I don't always stay up-to-date on who's doing what in Epics development, so it is hard for me to tell if you are an "experienced Epics C developer who knows what they are talking about" or a "random person on the internet".

As I said when you first wrote about this issue, it is not at all clear to me what PV.disconnect() should do. I think that until the goal for PV.disconnect() is known, there isn't much point trying to "fix" it. "Fixing" implies it doesn't do what it is supposed to do.... So, what is it supposed to do?

There is a global cache of connection state information for each Channel. When you create a pyepics.PV object, it may create new entries into the cache. It's possible that a complex program will have multiple pyepics.PV objects pointing to the same underlying cache data. That is all by design and very much intended.

In what situation does one call pyepics.PV.disconnect()? I don't know - I can tell you that I rarely, if ever, call this in my apps. When calling this method, it is pretty clear that one means to stop the user-level callbacks for that PV object. But this also leaves the pyepics.PV object in use -- maybe the intention is to re-connect later?

Should the cache(s) be completely emptied? Well, that seems dangerous if there is another pyepics.PV pointing to that cache? What if a user expects pyepics.PV.disconnect() to momentarily suspend callbacks and then wants to "reconnect"? Should the complete set of CA search/connect/subscribe procedures be re-run? I don't know. It is pyepics.PV.disconnect(), not pyepics.ca.clear_channel(), so it does sort of apply to the high-level PV object, without much promise of what goes on under the hood.

newville avatar Mar 10 '21 22:03 newville

Sorry, but I am getting a bit tired of the philosophical discussion. Regardless of what you expect disconnect to do, there is a memory leak in the code. I am trying to help you fix it. Let me know if you want help with that.

I have the feeling you don't want to spend any time understanding the issue nor the fix, so I'll stop here.

Just one thing, I never said that calling ca.clear_channel() was a bad idea. I said uncommenting that line was a bad idea.

juanfem avatar Mar 10 '21 22:03 juanfem

@klauer do you have any comments or suggestions on how to proceed here? I'm afraid I may be missing some of the nuances of what this issue is really about. I can believe the proposed solution by @juanfem is the right approach, but I'm not certain that matches your experience too, and I'm not sure what others actually expect PV.disconnect to do in detail -- or maybe I don't understand the motivation to disconnect from a PV and then reconnect.

Do you have time to look into this, or have you all moved on to other projects?

newville avatar Mar 21 '21 23:03 newville

I'm rather busy with other projects at the moment, sadly. If there's all-around agreement as to what needs to be done here, I may be able to find time to help, though.

The goal of my contributions here has always been to improve pyepics' underlying code without interfering with how the current user-base tend to use the library. If I recall correctly (and I may not be, so please take this with a grain of salt), I believe I did not include clear_channel because there was no reference counting for the number of higher-level PV instances using the same underlying channel. Tracking that information correctly and without holding hard references and such will likely take some care. In any case, I then created the issue as a reminder and pushed the fix down the line.

klauer avatar Mar 22 '21 16:03 klauer

@klauer OK, thanks for that comment - I'm also mostly focused on other projects these days, but I would hope we can keep this code in working order and try to continue improving it. It does seem like the proposed change from @juanfem might be the sort of reference counting you are talking about. It does make sense to me to remove any connection and access callbacks from the cache, and I guess that I'm (slowly!) coming around to see the point that PV.__del__ calls PV.disconnect and so should try to clean up any cache entries for that PV. Still, I sort of suspect that there will be a lot of room for dangling callbacks unless one is using del PV or disconnecting from PVs all the time....

I guess that is to say that I would be willing to give this a try. It would be nice to have a test to go along with it too. @juanfem you seemed most interested in this, but also sort of impatient with the slowness (or my slowness) here. Are you willing to work on this?

newville avatar Mar 28 '21 15:03 newville

Sorry for the late reply, @newville. I am certainly willing to contribute.

It seems that both you and @klauer agree somehow that the disconnect method should clear the connection and the caches associated with the PV, so I'll work on that implementation, adding the reference counting. I'll prepare a pull request, it will probably take me a couple of weeks.

How confident are you on the current test suite? Can I rely on that or shall I do more extensive testing? I didn't have a look at it yet...

juanfem avatar Apr 07 '21 14:04 juanfem

@juanfem more extensive testing would be very wise and much appreciated. 👍

It's worth noting that there are some users - however very much the minority - that rely on the epics.ca layer and do not use the PV class at all. Whatever changes you make should ensure that both remain viable. It may be wise to move toward weakref.finalize here instead of relying on the entirely unreliable __del__ methods as well.

Unfortunately with these projects, you'll rarely get feedback while developing/fixing, but rather only bug reports once it makes its way into a release.

Anyway, those are some of my distracted/half-baked thoughts on the matter. Good luck with the fixes and thanks in advance for the help 😄

klauer avatar Apr 07 '21 16:04 klauer

@juanfem Yes, that would be great. My suggestion would be to try something like the code you suggested earlier in this thread: does that work for you and your tests? It certainly looks reasonable to do that check of the "count" on the callbacks. If that helps mitigate leaks of no-longer-reachable callbacks, that would probably be useful and is "unlikely" (famous last word) to cause other problems.

The testing framework here is kind of messy. There are plenty of tests mostly demonstrating correct behavior of expected actions, not so much on how problems are handled. The tests also more or less require PVs to be on the network and so an ioc to be running. Yes, that can be arranged and can be automated. And the TravisCI script automates that.

There are some tests that disconnecting and then reconnecting works for getting monitors, but I don't think there are any tests of what cleanup PV.disconnect does. Again, I think that the level of cleanup being discussed here is an implementation detail and that we have not really made any promises beyond "callbacks will stop" (but maybe we have?).

Personally, I would say people using the ca interface might need to know about cache clearing and that how to do that could be better documented (say, using the code you have and/or will come up with), but I sort of think that documenting that may be enough.

newville avatar Apr 07 '21 17:04 newville

Good point, @klauer . I didn't think about the epics.ca layer... I take note.

@newville , ok, then I'll start from the code I posted here. Just note that this approach does not disconnect the "low-level" channel (won't call ca.clear_channel). It will only clear monitors and callbacks and make sure there are no dangling references to the PV object, so that it can eventually be garbage collected.

I am fine with that solution, since it will mitigate the memory issues. It might be more performant in case someone wants to connect again to the same PV, because the connection is kept open. Also, it is unlikely that it will interfere with the ca interface. I can make a remark in the documentation of the disconnect method that the low-level channel will be kept open and that the user can call ca.clear_channel if needed (at their own risk).

I'll check the test suite and add some more unit tests before doing any other change.

juanfem avatar Apr 08 '21 08:04 juanfem

I've just submitted a pull request with the proposed fix, including some testing. Sorry for the delay, I've been a bit busy...

Please have a look at the PR #220 and let me know.

juanfem avatar Jun 02 '21 14:06 juanfem

@juanfem thanks! At first glance that looks good, but I'll try to study it more closely over the next week or so.

newville avatar Jun 02 '21 19:06 newville

I think that #220 fixes this issue, at least to first order (maybe see #221).

newville avatar Jun 05 '21 16:06 newville