Creating virtual clusters

In Cloudera Data Engineering (CDE), a virtual cluster is an individual auto-scaling cluster with defined CPU and memory ranges. Jobs are associated with virtual clusters, and virtual clusters are associated with an environment. You can create as many virtual clusters as you need. See Recommendations for scaling CDE ceployments linked below.

To create a virtual cluster, you must have an environment with Cloudera Data Engineering (CDE) enabled.

  1. In the Cloudera Data Platform (CDP) console, click the Data Engineering tile. The CDE Home page displays.
  2. Click Administration in the left navigation menu, select the environment you want to create a virtual cluster in.
  3. In the Virtual Clusters column, click at the top right to create a new virtual cluster.
    If the environment has no virtual clusters associated with it, the page displays a Create DE Cluster button that launches the same wizard.
  4. Enter a Cluster Name.
    Cluster names must include the following:
    • Begin with a letter
    • Be between 3 and 30 characters (inclusive)
    • Contain only alphanumeric characters and hyphens
  5. Select the Service to create the virtual cluster in.
    The environment you selected before launching the wizard is selected by default, but you can use the wizard to create a virtual cluster in a different CDE service.
  6. Select one of the following CDE cluster types:
    • Core (Tier 1): Batch-based transformation and engineering options include:
      • Autoscaling Cluster
      • Spot Instances
      • SDX/Lakehouse
      • Job Lifecycle
      • Monitoring
      • Workflow Orchestration
    • All-Purpose (Tier 2) - Develop using interactive sessions and deploy both batch and streaming workloads. This option includes all options in Tier 1 with the following:
      • Shell Sessions - CLI and Web
      • JDBC/SparkSQL (Coming soon)
      • IDE (Coming Soon)
  7. Select the Spark Version to use in the virtual cluster. You cannot use Spark 2 and Spark 3 in the same virtual cluster, but you can have separate Spark 2 and Spark 3 virtual clusters within the same CDE service. You can use multiple Spark 3.x in your virtual cluster.
  8. Use the Auto-Scale Max Capacity sliders to set the maximum number of CPU cores and the maximum memory in gigabytes. The cluster will scale up and down as needed to run the submitted Spark applications.
  9. Optional for spot instances enabled at the CDE service level: From the Driver and Executors will run on drop-down menu, select whether you want to run drivers and executors on spot instances or on-demand instances. By default, the driver runs on on-demand instances, and the executors run on spot instances. For SLA-bound workloads, select On-demand. For non-SLA workloads, Cloudera recommends leaving the default configuration to take advantage of the cost savings afforded by spot instances. For more information, see Cloudera Data Engineering Spot Instances.
  10. Optional: Click Enable Remote Shuffle Service (Technical Preview) if you want to store Spark shuffle data on remote servers. Usage of Remote Shuffle Service (RSS) improves resilience in the case of executor loss and it allows to run jobs with regular Dynamic Allocation (without shuffle tracking). It may also decrease the job execution time depending on the number of RSS instances.
    1. Override Instances: Click to override the recommended RSS instances for this virtual cluster.
      The recommended number of RSS instances: the recommendation on a number of RSS instances for a virtual cluster is according to the defined virtual cluster CPU quota. This number is obtained according to the RSS performance testing results. Increasing the number of RSS servers to some point could help decrease job execution time, but will also have cost implications.
    2. Instances: Drag the slider button to specify the number of RSS instances.

      Each RSS instance runs on i3.xlarge AWS EC2 instance, so each instance will contribute i3.xlarge cost to the total CDE cost . Overall shuffle data volume that can be stored by a virtual cluster at the same time is number of RSS instances * 880 GiB (gibibytes).

  11. Optional: Select Enable Iceberg analytic tables to enable Spark jobs running within the virtual cluster to create and access Apache Iceberg tables.
  12. Optional: Select Restrict Access to add access control for the virtual cluster. You can search for users to add by name or email address. You can manage users using the Cloudera Data Platform Management Console. For more information, see Managing user access and authorization.
  13. Optional: Click Configure Email Alerting (Technical Preview) if you want to receive notification mails.The email configuration options appear.
    1. You must provide at least Sender Email Address and SMTP Host information.
    2. Test SMTP Configs: Click Test SMTP Configs to test the configurations set for SMTP. This helps you to test the SMTP configuration before creating the cluster.
  14. Click Create.
On the CDE Home page, select the environment to view the virtual cluster initialization status. You can also click the three-dot menu for the virtual cluster to view the logs.