flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

libhostlist: hang w/ hostlist_sort()

Open chu11 opened this issue 9 months ago • 1 comments

I discovered this in https://github.com/chaos/genders with the older generation hostlist library and wondered if it existed here as well and it appears it does.

#include <stdio.h>
#include <stdlib.h>
#include </usr/include/flux/hostlist.h>

int main (int argc, char **argv)
{
    struct hostlist *hl = hostlist_create();

    hostlist_append(hl, "elcap[10-20,1-104,201-896,1001-12136,2,40-42,201-896,37-42,45-46,60-61,201-896,1001-12136,1,1017-12120,201-896,1001-12136,2-36,201-896,1,47-59,62-74,77-89,92-104,201-896,1001-12136,45-46,61,1-44,3-36,1001-1016,12121-12136,1,10-20,10-20,10-15,201-896,1017-12120,45-46,60-61,201-896,1001-12136,60,60,201-896,1001-12136,43-44,37-45,1-104,201-896,1001-1016,12121-12136,2-104,201-896,1001-12136,1,201-896,1001-12136,47-59,62-74,77-89,92-104,47-59,62-74,77-89,92-104,75-76,90-91]");

    printf("ranged string = %s\n", hostlist_encode(hl));

    /* printf("do uniq now\n"); */
    /* hostlist_uniq(hl); */

    printf("trying to sort now\n");
    hostlist_sort(hl);
}

> gcc infiniteloop.c -lflux-hostlist
> ./a.out
ranged string = elcap[10-20,1-104,201-896,1001-12136,2,40-42,201-896,37-42,45-46,60-61,201-896,1001-12136,1,1017-12120,201-896,1001-12136,2-36,201-896,1,47-59,62-74,77-89,92-104,201-896,1001-12136,45-46,61,1-44,3-36,1001-1016,12121-12136,1,10-20,10-20,10-15,201-896,1017-12120,45-46,60-61,201-896,1001-12136,60,60,201-896,1001-12136,43-44,37-45,1-104,201-896,1001-1016,12121-12136,2-104,201-896,1001-12136,1,201-896,1001-12136,47-59,62-74,77-89,92-104,47-59,62-74,77-89,92-104,75-76,90-91]
trying to sort now

This appears to hang and won't finish. If you call hostlist_uniq() before hostlist_sort(), then this finishes.

For those who are curious, the ridiculous hostrange is me pushing hosts onto a hostlist as they come up in the genders file (i.e. the order they are listed in the genders file).

chu11 avatar Feb 26 '25 18:02 chu11

I started looking into this and part of the problem seems to be that this is a bit of a nightmare input for quicksort(), especially because hostrange_cmp() returns 0 when the first id in a hostrange matches, which occurs when there are duplicates as in this input (which has a lot of duplicates). This seems to result in some kind of unstable sorting.

When I worked around that issue, the next problem is just the sheer number of intersecting host ranges in this input. (there are over 120K entries in this list)

After sorting host ranges, libhostlist then has to check if there are neighboring intersecting ranges and splits those so that the duplicate hosts are all listed in order. It does this pretty inefficiently, so this stage takes over 4 minutes using this input set. I haven't yet looked to see if there's an improvement to be made there.

I'm still looking for a smaller input list that demonstrates the quicksort issue so I can add a test for at least that part of the fix.

grondo avatar Mar 06 '25 15:03 grondo