artiq icon indicating copy to clipboard operation
artiq copied to clipboard

inefficient saving of dataset_db.pyon

Open jwu-ions opened this issue 2 years ago • 15 comments

Bug Report

One-Line Summary

Artiq hardware signal output/dashboard updates freeze/drop out for about 5 seconds every 30 seconds when RPC is used.

Issue Details

MWE (after TTLS, DDSes are initialized separately)

from artiq.experiment import delay
from artiq.language.units import us
from artiq.language import kernel
from artiq.language.environment import EnvExperiment

class DropOutDebugging(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl29")
        self.setattr_device("scheduler")

    @kernel
    def run(self):
        self.core.reset()

        while not self.scheduler.check_pause():
            self.core.break_realtime()
            delay(100*us)
            self.ttl29.pulse(100000*us)

If "while True" is used above instead the pulses loop successfully.

Otherwise, every 30 seconds the pulses disappear for about 5 seconds, then return. If the TTL is overridden using moninj then the pulses are unaffected. If DDSes are used, they freeze at whatever frequency/amplitude they are set to at the end of the while loop.

During the signal drop out, datasets can be updated/deleted but the update doesn't show up on the dashboard until the dropout is over.

This duty cycle is consistent in other experiments as well (about 35 second totals), but if a loop without check_pause() is initiated then the hardware output is maintained indefinitely.

Most relevant hardware has been checked/swapped:

  • Computer
  • PCI Ethernet adapter
  • Ethernet cable and all connectors/feedthroughs
  • KC705

Your System (omit irrelevant parts)

Artiq 3.7 on Windows 7, KC705 directly connected to PC via ethernet cable.

jwu-ions avatar Aug 09 '21 21:08 jwu-ions

MWE

Does this really need to involve the core device/kernels? I suspect not. What happens with:

    def run(self):  # note: no @kernel
        i = 0
        while not self.scheduler.check_pause():
            print(i)
            i += 1
            time.sleep(0.1)

If the TTL is overridden using moninj then the pulses are unaffected.

What does that mean? If the TTL is overridden it shouldn't pulse.

Windows 7

Are you running antivirus software on that lovely system?

sbourdeauducq avatar Aug 09 '21 22:08 sbourdeauducq

Does this really need to involve the core device/kernels? I suspect not.

Yes, trying your example yields the same results; every 300 counts we get a dropout.

What does that mean? If the TTL is overridden it shouldn't pulse.

Yup. That was behavior as expected, sorry for not being clear.

Are you running antivirus software on that lovely system?

We have whatever antivirus protection is standard here at NIST. I should add that the computer that we tried swapping to (and later swapped back from) was a brand new Windows 10 computer with barely anything else on it so I would hope that this is not being caused by a virus or anything similar.

jwu-ions avatar Aug 09 '21 23:08 jwu-ions

Does this really need to involve the core device/kernels? I suspect not.

Yes, trying your example yields the same results; every 300 counts we get a dropout.

OK and that happens without a dashboard connected? And with just the minimum number of components i.e. only the master and one submission via artiq_client; no other requests or connections to the master, no controller manager, etc.

Does it also happen with artiq_run?

sbourdeauducq avatar Aug 09 '21 23:08 sbourdeauducq

OK and that happens without a dashboard connected? And with just the minimum number of components i.e. only the master and one submission via artiq_client; no other requests or connections to the master, no controller manager, etc.

Does it also happen with artiq_run?

We tried with no dashboard using artiq_run and the test code works fine; we also tried a more complicated script and that works with no dropouts as well.

jwu-ions avatar Aug 10 '21 17:08 jwu-ions

@sbourdeauducq any ideas for what we can try next?

jwu-ions avatar Aug 17 '21 19:08 jwu-ions

Obviously you should install an operating system.

Short of this, are you able to try with the antivirus disabled? They tend to cause this sort of random I/O performance problems and check_pause does perform I/O (via the IPC named pipe and possibly the log).

Have you tried without the dashboard connected and just a submission via artiq_client as suggested?

Other than that, add timed prints at various place inside the check_pause code to see where the blockage is.

sbourdeauducq avatar Aug 17 '21 23:08 sbourdeauducq

Short of this, are you able to try with the antivirus disabled? They tend to cause this sort of random I/O performance problems and check_pause does perform I/O (via the IPC named pipe and possibly the log).

Unfortunately I do not think we are able to disable the antivirus. We have allowed openocd and python connections through the firewall and this has not helped.

Have you tried without the dashboard connected and just a submission via artiq_client as suggested?

I used the command line with artiq_run and things worked fine, so it seems like connecting the dashboard causes the problem, as you suspected:

OK and that happens without a dashboard connected? And with just the minimum number of components i.e. only the master and one submission via artiq_client; no other requests or connections to the master, no controller manager, etc.

Does it also happen with artiq_run?

We tried with no dashboard using artiq_run and the test code works fine; we also tried a more complicated script and that works with no dropouts as well.

Other than that, add timed prints at various place inside the check_pause code to see where the blockage is.

I've modified the code as follows:

    def check_pause(self, rid):
        """Returns ``True`` if there is a condition that could make ``pause``
        not return immediately (termination requested or higher priority run).

        The typical purpose of this function is to check from a kernel
        whether returning control to the host and pausing would have an effect,
        in order to avoid the cost of switching kernels in the common case
        where ``pause`` does nothing.

        This function does not have side effects, and does not have to be
        followed by a call to ``pause``.
        """
        print("Point 1")
        for pipeline in self._pipelines.values():
            print("Point 2")
            if rid in pipeline.pool.runs:

                print("Point 3")
                run = pipeline.pool.runs[rid]
                print("Point 4")
                if run.status != RunStatus.running:
                    return False
                print("Point 5")
                if run.termination_requested:
                    return True
                print("Point 6")

                prepared_runs = filter(lambda r: r.status == RunStatus.prepare_done,
                                       pipeline.pool.runs.values())
                print("Point 7")
                try:
                    r = max(prepared_runs, key=lambda r: r.priority_key())
                except ValueError:
                    # prepared_runs is an empty sequence
                    return False
                print("Point 8")
                return r.priority_key() > run.priority_key()

            print("Point 9")
        raise KeyError("RID not found")

When there are no queued experiments, it always prints "Point 1" up to "Point 7", even during the dropout. When there is another experiment queued after, it prints "Point 8" as well (as expected). So it appears to complete the whole block before stopping?

jwu-ions avatar Aug 27 '21 17:08 jwu-ions

We have allowed openocd and python connections through the firewall and this has not helped.

OpenOCD is not involved in this issue.

I used the command line with artiq_run and things worked fine, so it seems like connecting the dashboard causes the problem, as you suspected:

artiq_run is a simple single-process program which does not use the master/worker.

When there are no queued experiments, it always prints "Point 1" up to "Point 7", even during the dropout.

OK, put some more around the IPC code, which is probably where the problem is. Again, this likely would not be an issue if you were using an operating system.

sbourdeauducq avatar Aug 27 '21 23:08 sbourdeauducq

We are now using a Windows 10 computer but the problem has persisted. I have a new MWE which doesn't even involve the scheduler.

The code below runs fine on anaconda prompt but when run on the dashboard, it stops outputting every ~30 seconds for ~5 seconds. After the code resumes, the outputs for the last 5 seconds appear all at once.

from artiq.language.environment import EnvExperiment
import time

class DropOutDebugging(EnvExperiment):
    def run(self):
        i = 0
        while 1:
            i += 1
            time.sleep(0.1)
            print(i)

jwu-ions avatar Feb 24 '22 20:02 jwu-ions

Maybe the code is still running in the background, only the log isn't updated in real-time. Try writing to a network socket to confirm. This was supposed to prevent this issue: https://github.com/m-labs/artiq/blob/bdaaf3c1d7fe362b9c741f4e8c13bca1e52d8874/artiq/master/worker.py#L88

sbourdeauducq avatar Feb 24 '22 23:02 sbourdeauducq

We've found that the issue is related to the file size of dataset_db.pyon. With a blank dataset file, the issue is unnoticeable/gone.

Our original file size was 19 MB which caused 5 second dropouts every 30 seconds. I eventually cut it down to 3.11MB, and we still have some dropouts (>10 ms long every 30 seconds). While we will be more cautious with how and which data are saved going forward, it is conceivable that the files could begin to get larger.

Is there any way to avoid this issue?

jwu-ions avatar Feb 25 '22 22:02 jwu-ions

Saving changes to dataset_db.pyon isn't done efficiently at the moment. Something like sqlite (+ tracking the changes) would probably solve it, though it's a bit heavy-handed to use SQL for just a key-value store. Do you know better alternatives?

sbourdeauducq avatar Feb 25 '22 23:02 sbourdeauducq

Or we can make the datasets a full SQL database. If there's a use case for that...

sbourdeauducq avatar Feb 25 '22 23:02 sbourdeauducq

I think we will have to discuss this a bit more within our group to figure out what is appropriate going forward. My last thought about this right now is: What is it about the dashboard that causes the issue to manifest (or increases the impact of the issue) compared to when the anaconda prompt is used?

@dhslichter may follow up later.

jwu-ions avatar Mar 03 '22 21:03 jwu-ions

If by "anaconda prompt" you mean artiq_run, that's because it does not periodically save the dataset db file.

sbourdeauducq avatar Mar 03 '22 22:03 sbourdeauducq