pcluster-manager icon indicating copy to clipboard operation
pcluster-manager copied to clipboard

Create Pcluster Manager Fails

Open sean-smith opened this issue 2 years ago • 7 comments

Hi,

Using the following cluster template breaks the create interface after the Storage Tab with:

image

From the Chrome console I see:

framework-bb5c596eafb42b22.js:1 TypeError: Cannot read properties of undefined (reading 'length')
    at index-e0a78d528f2085ff.js:1:170318
    at Array.filter (<anonymous>)
    at index-e0a78d528f2085ff.js:1:170289
    at _a (index-e0a78d528f2085ff.js:1:170339)
    at oo (framework-bb5c596eafb42b22.js:1:59416)
    at Ku (framework-bb5c596eafb42b22.js:1:111716)
    at Li (framework-bb5c596eafb42b22.js:1:98957)
    at Ni (framework-bb5c596eafb42b22.js:1:98885)
    at Pi (framework-bb5c596eafb42b22.js:1:98748)
    at bi (framework-bb5c596eafb42b22.js:1:95714)
cu @ framework-bb5c596eafb42b22.js:1
main-1ae0bdeb4d020668.js:1 TypeError: Cannot read properties of undefined (reading 'length')
    at index-e0a78d528f2085ff.js:1:170318
    at Array.filter (<anonymous>)
    at index-e0a78d528f2085ff.js:1:170289
    at _a (index-e0a78d528f2085ff.js:1:170339)
    at oo (framework-bb5c596eafb42b22.js:1:59416)
    at Ku (framework-bb5c596eafb42b22.js:1:111716)
    at Li (framework-bb5c596eafb42b22.js:1:98957)
    at Ni (framework-bb5c596eafb42b22.js:1:98885)
    at Pi (framework-bb5c596eafb42b22.js:1:98748)
    at bi (framework-bb5c596eafb42b22.js:1:95714)
ee @ main-1ae0bdeb4d020668.js:1
main-1ae0bdeb4d020668.js:1 A client-side exception has occurred, see here for more info: https://nextjs.org/docs/messages/client-side-exception-occurred

Here's the template that broke it, with some params changed:

Region: us-east-2
Image:
  Os: alinux2
HeadNode:
  InstanceType: c6i.2xlarge
  Networking:
    SubnetId: subnet-123456789
  Ssh:
    KeyName: keypair
  DisableSimultaneousMultithreading: true
  LocalStorage:
    RootVolume:
      Size: 100
      VolumeType: gp3
  CustomActions:
    OnNodeConfigured:
      Script: >-
        https://bucket.us-east-2.amazonaws.com/headnode_install.sh
      Args:
        - HEAD
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  Dcv:
    Enabled: true
Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: compute
      CapacityType: ONDEMAND
      ComputeResources:
        - Name: compute-hpc6a
          Efa:
            Enabled: true
            GdrSupport: true
          InstanceType: hpc6a.48xlarge
          MinCount: 0
          MaxCount: 101
          DisableSimultaneousMultithreading: true
      Networking:
        SubnetIds:
          - subnet-123456789
        PlacementGroup:
          Enabled: true
      CustomActions:
        OnNodeConfigured:
          Script: >-
            https://bucket.s3.us-east-2.amazonaws.com/compute_install.sh
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
SharedStorage:
  - MountDir: /opt/ncar
    Name: ncar
    StorageType: Ebs
    EbsSettings:
      Size: '35'
      VolumeType: gp3
      DeletionPolicy: Delete
  - MountDir: /scratch
    Name: scratch
    StorageType: FsxLustre
    FsxLustreSettings:
      FileSystemId: fs-0e709f43fbde2c3a2

sean-smith avatar Nov 14 '22 17:11 sean-smith

Thank you for reaching out. This a known issue that's related to the fact you are using a template created with 3.2.0 on PCM 3.3.0

There is no fix available at the moment.

As a workaround, you can edit the template by replacing Scheduling > SlurmQueues > ComputeResources > InstanceType and using the new Instances property introduced with PC 3.3.0

mendaomn avatar Nov 15 '22 08:11 mendaomn

Is there a web page that lists issues like this? These issues are not “known issues” to the customer. Perhaps a web page of known bugs would help. As a customer, this caused me some churn. Particularly when the issue is the result of a lack of backwards config file compatibility for a relatively minor update (3.2->3.3), it would be helpful to let your customers know.

mcb-silverlining avatar Nov 15 '22 20:11 mcb-silverlining

@Silver-Linda

Thank you for voicing your concerns, we are working hard on adding a comprehensive documentation for the product but unfortunately we still are not there.

We could have pointed this out in our changelog as a breaking change, and will adopt this strategy in the future, so that customers are informed on what is happening and how to work around it

Regarding this specific issue, we are working on a fix, since we realize the app is not supposed to crash due to lack of backwards compatibility (you can follow this issue to be notified when the fix lands).

However, at the moment PCM does not support creating a cluster starting from a template created with previous versions. We are actively working on figuring out the best approach to take from here on out

mendaomn avatar Nov 16 '22 09:11 mendaomn

@mendaomn

Thanks for the transparent reply. Pcluster manager is an extremely important aspect of parallelcluster usability but a lack of backwards compatibility (a feature pcluster manager used to have) makes it very difficult to maintain clusters while updating to any new parallelcluster versions. Not only does this impact my hopes of version control of the yaml config file, it is hard to create a yaml config file from scratch, even with pcluster manager. The parallelcluster move to a complicated yaml config file necessitated some sort of tool that helped create that file. Backwards compatibility seems critical to me.

mcb-silverlining avatar Nov 16 '22 16:11 mcb-silverlining

Re-opening as current release still has this issue.

sean-smith avatar Jan 10 '23 19:01 sean-smith

Definitely need to update the template used for this workshop to include the workaround now that the pcluster manager template defaults to 3.3.0

Also need to update the CLI instructions to install pcluster 3.3.0 otherwise the CLI-created cluster (which currently will use 3.4.0 by default as the latest version) you can't go back and view the CLI-created cluster using the 3.3.0 version pcluster manager. You'll get an error "Cluster hpc belongs to an incompatible ParallelCluster major version"

natalie-white-aws avatar Jan 10 '23 19:01 natalie-white-aws

@natalie-white-aws my PR updates the template to resolve this issue and we'll release 3.4.0 with ParallelCluster Manager shortly.

sean-smith avatar Jan 12 '23 20:01 sean-smith