chia_plot_manager Multiple Plotters

Multiple Plotters

Open thomascooper opened this issue 3 years ago • 34 comments

Hey! Love the script, thanks for all the work you've put into it. I have 3 separate dedicated plotting machines, the plot_manager.py and kill_nc.py scripts (as well as send/rec plot scripts) are designed for a single plotter. Have any suggestions or thoughts around removing this limitation.

Initially I decided to just append the hostname to the remote_checkfile variable, this way the plotters would not kill each others netcats, but multiple netcats over a gigabit network slows down the the transfer from approx 110MB/s on a single file, to a total of 60-70MB/s across all files. I'd love to avoid that loss of speed if possible, I am already at the cusp of needing 10GBE .

Jun 29 '21 04:06 thomascooper

Thomas - I have thought about the exact same thing and you are on the right track, I was planning on doing something like I do now with multiple harvesters where I get information from each one and make a choice about where I am going to send a plot.

Are you running just a single harvester or multiple harvesters/plotters?

If you have a single harvester and multiple plotters, we would just need to make sure that only a single plotter was transferring at the same time, otherwise, you could saturate your network, even if we used different Netcat ports which is how I was thinking of doing it.

Let me know more about your particular configuration and let me see if I can brainstorm an idea as to how to make it work.

Jul 02 '21 17:07 rjsears

@thomascooper waited for @rjsears implement it, but gave up. Ordered ready python script , it is using socat. i have 9 plotting machines and 1 harvester. also i have 10gbe network... making 9 plots every 11 minutes (1 on each plotting machine) and its working perfect. the only thing you need 9 disks on harvester with enough space to handle it. because copying 2 plots on 1 disk will slow things down.. 1 socat connection copying plot to 7200rpm hdd in about 11-12mins , so i need slow down plotters a little bit. i think problem will be solved when i will bound 2 10gbe NICs on harvester with 802.3ad ... @thomascooper and how you solved this?

Aug 14 '21 20:08 maxbelz

I have a single harvester and 2 plotters capable of 8TB/day. The harvester has 4GbE, plotters 1GbE each so no need to worry about network saturation.

The current version of the script checks traffic load on the NIC being used and stops if there is any, so yeah it is very much designed for a single plotter scenario. My super simple solution was to create an LXC on the farming machine and make it my "second" harvester, this way each of my plotters had a dedicated harvester to connect to.

Sep 14 '21 07:09 klemens-st

In the end I did it a little bit differently. Using ncat I am able to copy a plot from my plotters to my harvesters in about 6 minutes. Each of my plotters has a list of harvesters that it will talk to (defined in the config file) and it will make a determination once-per-minute on which harvester to send the plot to based on an available number of plots on that particular harvester (or number of old plots to replace). In this way my harvesters almost all have the same overall number of plots on them pretty much most of the time.

Since I have 13 harvesters now and 6 dedicated plotters all doing 288 plots-per-day (madmax on Quad CPU Dell servers) I needed a large number of harvesters to receive the plots. No need to slow anything down until I run low on drive space. Additionally, many of my harvesters are also running madmax and plotting internally to themselves (see move_local_plots.py). So as you can see, it is not "very much designed for a single plotter" as I am now up to 6 dedicated plotters and another 8 not-so-dedicated plotters. Instead, each plotter has a subset of harvesters that it will talk to and within that subset, it makes a determination each time it runs as to which harvester to send the next plot to. In my opinion, this was a far more sustainable solution to having one plotter sending multiple plots at the same time, at least in my situation.

I am well over 3PiB and this setup works fantastic for me. I guess when I get more time I will try and think of a better way to do this, but what I have also found is that when my plotters are running at 100% CPU and memory utilization, I get zero benefit by trying to send multiple plots at the same time, the boxes are just not powerful enough to make plots and saturate my 10G network.

Sep 14 '21 13:09 rjsears

Sorry about the delay in my response. For those of us with a single harvester/NAS/farmer but multiple plotters the current system has a few major issues operating in this case.

plot_manager.py checks to see if ncat is running and if there is traffic on the local plotter, if there is not then at the start of process plot it calls a remote kill all netcats. This is problematic with multiple plotters because every plotter will check and see they are not plotting and then kill all remote transfers. This leads all plotters into a loop where they just continually kill each others netcats and never finishing a single transfer. I have fixed this problem with a hack of adding a function to check the ncat on the server remotely and stop the plotmanager if it find it. Then I added a script to the NAS that checks its network traffic and kills all ncats if there is no traffic. This solution works for the most part, sometimes the NAS script skips a beat on traffic and kills good transfers, and sometimes the plotters all run at the same time and start 3-4 ncat processes.

I think the best fix to this would be to have the plot_manager scripts only look the remote checkfile, and have that be 100% responsible for starting a new plot transfer. Then add a new script to the NAS with 2 functions:

Look for the checkfile, if the checkfile exists (and was created longer than 3 minutes ago) then check for a NC and transfer rate if either is not there, then killall ncat and delete the checkfile
Look for ncat if its exists, monitor traffic for 20seconds, if there is no traffic then killall ncat and delete any checkfiles

I'll take a stab at creating this script myself and see where I endup

Sep 16 '21 15:09 thomascooper

No worries, ok, so a couple of questions. Are you wanting to send multiple plots from multiple plotters to a single harvester at the same time? What is your network infrastructure? I did set a similar setup for a friend and just hard-coded different ncat ports and tweaked the checks a little bit. The other option is to do what you are suggesting and not worry about killing the ncat but rather pass along active transfer information with the export from the harvester that already goes to the plotter as the definitive answer to if a transfer is in progress. But that would really depend on what you are trying to accomplish in the long run.

I am absolutely certain we could figure out a way to make it work, I just really need to understand your specific use case.

Sep 16 '21 17:09 rjsears

The expectation is to send a single plot at a time to the NAS. I have a very similar network infrastructure to your example, I have a 10.x.x.x network (actually eno2 on each machine) for plot_manager transfers, then a 192.168.x.x network for external communication.

1x 48 Port Switch (2 VLANs) 1GBE 3x Standalone plotters (each with dual 1GBE nics) 1x NAS Server (dual 1GBE nics, HBA, also a local plotter) 3x NETAPP DS4243

Sep 16 '21 18:09 thomascooper

OK,so just thinking out loud here but once per minute (via cron) I create a JSON export file for each harvester that has a ton of info for reporting, selecting which harvester to send plots to, etc. I could easily add a key to that file that indicates if there is a current transfer taking place. When you run .plot_manager.py, it grabs those JSON files from each harvester, or your single harvester in this case, and makes decisions based on the information contained in the file.

I could check to see if there is an existing transfer that way (driven by the harvester) and not allow a transfer in that case. In this case, transfers would be done on a first-come-first-served basis.

Sep 16 '21 18:09 rjsears

That sounds like a good solution in my opinion.

Sep 16 '21 19:09 thomascooper

OK, let me play around with it for a bit. I already updated the export to include the info, now I need to figure out how to do it on the plotter side....

Sep 16 '21 20:09 rjsears

Ok, V0.97 I think will work for you now. You need to install it on both your harvester (first) and then your plotter. Make sure to run ./drive_manager.py on your harvester to generate the new export before running ./plot_manager.py on your plotters.

Basically, it will not allow you to send a plot to a harvester if that harvester is reporting itself to have a transfer active. Also removed the remote call to kill_nc.sh as I upgraded from nc to ncat and I don't think it has the same issues. I will have to keep an eye ion it to see.

Let me know how it works out for you.

Sep 16 '21 23:09 rjsears

Everything worked on individual runs and manually testing plot_manager.py gave the new error message as expected. But at some point the drive_manager.py script must have marked the export remote_transfer_active as false and let a second upload start. It has occurred multiple times since I updated

Sep 17 '21 02:09 thomascooper

Hum, the only thing I can think would be that the drive_manager.py on the harvester does not think that there is a transfer going and hence marked it as false. Are you running drive_manager.py once per minute via cron?

I might suggest manually viewing the export file (locate in the export directory) for the harvester while a transfer is in progress and make sure it shows that a transfer is still going. If it does not, we need to troubleshoot a bit further too see why it thinks the transfer has stopped.

Sep 17 '21 02:09 rjsears

99% of the time when I check the export shows true as expected, but randomly it reverts to false, I have to assume its an issue with the transfer speed checking

Good one:

root@nas-1:~/plot_manager# ./drive_manager.py && cat export/nas-1_export.json
Welcome to drive_manager.py Version: 0.97 (2021-09-16)
check_temp_drive_utilization() started
Temp Drive(s) check complete. All OK!
check_dst_drive_utilization() started
DST Drive(s) check complete. All OK!
check_plots_available() started
Plot check complete. All OK!
nas_report_export() started
check_for_active_remote_transfer() called
Remote Transfer File does not exist, lets check for network traffic to verify....
check_network_activity() called
Network Activity detected on eno2
Network traffic has been detected, a Remote Transfer is in progress.
send_new_plot_notification() Started
update_receive_plot() Started
Replace Plots Set, will call build script for plot replacement!
Checking to see if we need to fill empty drives first......
fill_empty_drives_first flag is set. Checking for empty drive space.....
Found Empty Drive Space!
Low Water Mark: 10 and we have 250 available
Remote Transfer in Progress, will try again soon!
{"server": "nas-1", ..., "remote_transfer_active": true, "trigger": "2"}

Then randomly a few minutes later (while the transfer is still active and I can still see ncat in iotop and ps, I get the following.

root@nas-1:~/plot_manager# ./drive_manager.py && cat export/nas-1_export.json
Welcome to drive_manager.py Version: 0.97 (2021-09-16)
check_temp_drive_utilization() started
Temp Drive(s) check complete. All OK!
check_dst_drive_utilization() started
DST Drive(s) check complete. All OK!
check_plots_available() started
Plot check complete. All OK!
nas_report_export() started
check_for_active_remote_transfer() called
Remote Transfer File does not exist, lets check for network traffic to verify....
check_network_activity() called
No Network Activity detected on eno2
No Current Remote Transfers are taking place.
send_new_plot_notification() Started
update_receive_plot() Started
Replace Plots Set, will call build script for plot replacement!
Checking to see if we need to fill empty drives first......
fill_empty_drives_first flag is set. Checking for empty drive space.....
Found Empty Drive Space!
Low Water Mark: 10 and we have 250 available
Currently Configured Plot Drive: /mnt/enclosure0/front/row5/drive20
System Selected Plot Drive:      /mnt/enclosure0/front/row5/drive20
Configured and Selected Drives Match!
No changes necessary to /root/plot_manager/receive_plot.sh
Plots left available on configured plotting drive: 106
{"server": "nas-1", ...m "remote_transfer_active": false, "trigger": "2"}

I tested it again right after it was true again.

Sep 17 '21 02:09 thomascooper

Just as a bandaid I tried to run the following: pgrep ncat |awk 'NR >= 2' | xargs -n1 kill

This looks at all ncat processes, then if finds more than one running it will kill all except the first one. This works but it leaves the first ncat averaging between 15-20M/s (instead of 110M/s which it was doing before the second started)

Sep 17 '21 03:09 thomascooper

In order to make sure that the cron isn't just trying to start all 3 remote plotters at the same exact moment, I appliled the following changes to each of the 3 remote plotters crontab:

0-57/3 * * * * cd /root/plot_manager && /usr/bin/python3 /root/plot_manager/plot_manager.py >/dev/null 2>&1
1-58/3 * * * * cd /root/plot_manager && /usr/bin/python3 /root/plot_manager/plot_manager.py >/dev/null 2>&1
2-59/3 * * * * cd /root/plot_manager && /usr/bin/python3 /root/plot_manager/plot_manager.py >/dev/null 2>&1

This should run the script every 3 minutes but offset by 1 minute per server so that they do not cross with eachother

Sep 17 '21 03:09 thomascooper

Maybe throw a generalized try / except around the entire check_network_activity() and if it errors just try it again. When I was running my manual script to look for duplicate ncats and kill them every now and again I would get an error trying to read the network_stats.io file, sort like if the network_check silently failed and left no file.

Here is what I did simply to get passed that issue:

while True:
  try:
    transfer_isactive = check_for_active_remote_transfer()
    break
  except FileNotFoundError as err:
    pass

This particular version is not very elegant at all and was subject to getting stuck in a loop if there was any error other than a FileNotFoundError, but it generally allowed the script a second pass which never failed 2x in row.

Sep 17 '21 03:09 thomascooper

I woke up this morning to a hung ncat process with no transfers on going and all 3 plotters out of space so this must have happened early into the night.

Sep 17 '21 13:09 thomascooper

OK, so I think I need to add back in the ncat kill so that there are no longer any hung ncat processes. I can also rewrite the check_network_activity() function to have better error checking to verify, as you say, that the file exists and if it does not, run the check again.

I have to think about it a little differently than I was before, so let me think about how I can rewrite it and work on it.

FWIW, all of my plotters and harvesters are running 0.97 and none of them had stuck ncat processes, so I am not sure where it is hanging up!! Very frustrating.

Sep 17 '21 16:09 rjsears

OK so my first step was to alter the install script and rewrite the check_network_io.sh script and have it do the heavy lifting to make certain we never not have an output file. That should eliminate the need for your change above. It might not be the perfect solution, but I think it will eliminate those weird circumstances when there is no network_stats.io file found.

I am still working on the false false when there is a network transfer in place. I cannot get it to fail on any of my systems so I am not sure what I am looking at. I am attempting to think about a way around it to make sure it reports that variable correctly!

This is the new script (created by running ./install.sh nas):

#! /bin/bash
# Script automatically created by install script on: 2021-09-17 10:48:49"
network_interface=$1

check_network_traffic(){
echo -e "Checking for network traffic on $network_interface"
if [[ -f /root/plot_manager/network_stats.io ]]; then
   echo -e "\nFound old stats file, deleting...."
   rm /root/plot_manager/network_stats.io
   /usr/bin/sar -n DEV 1 3 | egrep $network_interface > /root/plot_manager/network_stats.io
   sleep 1
else
   echo -e "\nDid not find old stats file...."
   /usr/bin/sar -n DEV 1 3 | egrep $network_interface > /root/plot_manager/network_stats.io
   sleep 1
fi

if [[ -f /root/plot_manager/network_stats.io ]]; then
   return 0
else
   return 1
fi
}

until check_network_traffic ; do : ; done

Sep 17 '21 18:09 rjsears

Just as a bandaid I tried to run the following: pgrep ncat |awk 'NR >= 2' | xargs -n1 kill

This looks at all ncat processes, then if finds more than one running it will kill all except the first one. This works but it leaves the first ncat averaging between 15-20M/s (instead of 110M/s which it was doing before the second started)

Yeah, I think I have to go back to making sure ncat does not stay running. It is not supposed to stay running after receiving the EOF from the sending side, but obviously, this is not that case. I think the way you have your cron setup may help since, in theory, there should not be any ncat running after the transfer has taken place and none of the other plotters should be able to start a transfer if one is already taking place.

Sep 17 '21 19:09 rjsears

OK, so I did some more reading up on ncat (https://nmap.org/ncat/guide/ncat-file-transfer.html) and added a couple of options to both the send and receive scripts. The big one was the --send-only flag on the plotter. This immediately quits on EOF which should force the listener to quit as well. Give it a shot and tell me if it helps any.

Sep 18 '21 00:09 rjsears

OK, so in looking over the code, I decided to prevent a false false transfer thus leading to a double plot transfer, if we get a false from check_network_activity() we re-run it a second time just to verify that the false is really correct. I am sure it is not the best method, but I think it may resolve the weird false return when it should be true instead.

Sep 18 '21 01:09 rjsears

Sounds good, that is what i had my make-shift addition doing. So far so good with your newest commits, ill let you know if I see any issues.

Sep 18 '21 01:09 thomascooper

Perfect, thanks! I really appreciate working with you on this stuff, helps me find errors and better ways of doing things.

Sep 18 '21 02:09 rjsears

No problem. At the moment everything is running smooth with only 2 personal customizations:

Changed column to * in all the globs so that I could name my NetApp DS by "row" rather than column
Offset the start time for the crontab on each of the plotters so they don't all try and start a job at the same time.

Sep 18 '21 04:09 thomascooper

Yeah, I eventually want to redo the globs somehow so that it is easier to use other mount formats, I just haven't had time to figure it out or play with it yet.

Sep 18 '21 04:09 rjsears

24+ hours and no issues!

Sep 19 '21 06:09 thomascooper

Good deal!

Sep 20 '21 13:09 rjsears

Found a issue as space is running low, I have a single drive left with space and only 10 plots worth. I am getting the following error with move

root@nas-1:~/plot_manager# ./move_local_plots.py
Welcome to move_local_plots.py: Version 0.97 (2021-09-16)
update_move_local_plot() Started
Traceback (most recent call last):
  File "./move_local_plots.py", line 391, in <module>
    main()
  File "./move_local_plots.py", line 384, in main
    update_move_local_plot()
  File "./move_local_plots.py", line 87, in update_move_local_plot
    internal_plot_drive_to_use = get_internal_plot_drive_to_use()[0]
IndexError: string index out of range

Sep 20 '21 14:09 thomascooper

OK, that is very weird. I specifically address that very issue:

def get_internal_plot_drive_to_use():
    """
        Same as above but returns the next drive. This is the drive we will use for internal plots. We do
        this to make sure we are not over saturating a single drive with multiple plot copies. When you run
        out of drives, these scripts will fail.
        """
    available_drives = []
    try:
        for part in psutil.disk_partitions(all=False):
            if part.device.startswith('/dev/sd') \
                    and part.mountpoint.startswith('/mnt/enclosure') \
                    and get_drive_info('space_free_plots_by_mountpoint', part.mountpoint) >= 1 \
                    and get_drive_by_mountpoint(part.mountpoint) not in chianas.offlined_drives:
                drive = get_drive_by_mountpoint(part.mountpoint)
                available_drives.append((part.mountpoint, part.device, drive))
        return (natsorted(available_drives)[1])
    except IndexError: # We will get an IndexError when we run out of drive space
        # If we have no more drive space available on a drive not already being use to store local plot AND we are using pools and have elected to
        # replace non-pool plots, fall back to returning pool plot internal replacement drive:
        if chianas.pools and chianas.replace_non_pool_plots: # Sanity check, must have pools and replace_non__pool_plots set to true in config file.
            log.debug('CAUTION: No additional internal drives are available for use! Since you have replace_non_pools_plots set,')
            log.debug('we are going to return the next available local plot drives with old plots to replace.')
            return chiaplots.local_plot_drive
        else:
            # If we have no more drives left, fall back to the only drive left on the system with space available
            log.debug('CAUTION: No additional internal drives are available for use! Defaulting to using the next available drive with space available.')
            log.debug('This can cause contention on the drive bus and slow down all transfers, internal and external. It is recommended that you resolve')
            log.debug('this issue is you are able.')
            notify('Drive Overlap!', 'Internal and External plotting drives now overlap! Suggest fixing to prevent drive bus contention and slow transfers. If you have selected plot replacement, we will attempt to convert to replacement now.')
            return get_plot_drive_to_use()

Specifically this:

except IndexError: # We will get an IndexError when we run out of drive space

Can you check get_internal_plot_drive_to_use() in drive_manager.py and make sure that code is in there...?

Sep 20 '21 15:09 rjsears

Sorry about the delay, yes that code is there. And the error went away the moment I added a new hard drive and the space adjusted. Once I get near low space again in the next 10 or 15 days, I will test and reach out. I am also going to start a different thread to discuss an odd speed issue I am seeing.

Sep 22 '21 16:09 thomascooper

Once or twice a day I still catch two NCATs running, even with the changes to remote_transfer.

Sep 23 '21 14:09 thomascooper

OK, thanks for the update, I am just not certain what is causing that, but let me rebuild the kill script for the ncat, I will call it immediately after the transfer is complete to try and prevent it from wacking a transfer in progress.

Sep 24 '21 00:09 rjsears

chia_plot_manager chia_plot_manager copied to clipboard

Multiple Plotters

chia_plot_manager
chia_plot_manager copied to clipboard