Cluster#

A Cluster is a Runhouse primitive used for abstracting a particular hardware configuration. This can be either an on-demand cluster (requires valid cloud credentials), a BYO (bring-your-own) cluster (requires IP address and ssh creds), or a SageMaker cluster (requires an ARN role).

A cluster is assigned a name, through which it can be accessed and reused later on.

Cluster Factory Methods#

runhouse.cluster(name: str, host: str | List[str] | None = None, ssh_creds: Dict | str | None = None, server_port: int | None = None, server_host: str | None = None, server_connection_type: ServerConnectionType | str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, domain: str | None = None, den_auth: bool = False, dryrun: bool = False, **kwargs) Cluster | OnDemandCluster | SageMakerCluster[source]#

Builds an instance of Cluster.

Parameters:
  • name (str) – Name for the cluster, to re-use later on.

  • host (str or List[str], optional) – Hostname (e.g. domain or name in .ssh/config), IP address, or list of IP addresses for the cluster (the first of which is the head node).

  • ssh_creds (dict or str, optional) – SSH credentials, passed as dictionary or the name of an SSHSecret object. Example: ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'}

  • server_port (bool, optional) – Port to use for the server. If not provided will use 80 for a server_connection_type of none, 443 for tls and 32300 for all other SSH connection types.

  • server_host (bool, optional) – Host from which the server listens for traffic (i.e. the –host argument runhouse start run on the cluster). Defaults to “0.0.0.0” unless connecting to the server with an SSH connection, in which case localhost is used.

  • server_connection_type (ServerConnectionType or str, optional) – Type of connection to use for the Runhouse API server. ssh will use start with server via an SSH tunnel. tls will start the server with HTTPS on port 443 using TLS certs without an SSH tunnel. none will start the server with HTTP without an SSH tunnel. aws_ssm will start the server with HTTP using AWS SSM port forwarding.

  • ssl_keyfile (str, optional) – Path to SSL key file to use for launching the API server with HTTPS.

  • ssl_certfile (str, optional) – Path to SSL certificate file to use for launching the API server with HTTPS.

  • domain (str, optional) – Domain name for the cluster. Relevant if enabling HTTPs on the cluster.

  • den_auth (bool, optional) – Whether to use Den authorization on the server. If True, will validate incoming requests with a Runhouse token provided in the auth headers of the request with the format: {"Authorization": "Bearer <token>"}. (Default: False).

  • dryrun (bool) – Whether to create the Cluster if it doesn’t exist, or load a Cluster object as a dryrun. (Default: False)

Returns:

The resulting cluster.

Return type:

Union[Cluster, OnDemandCluster, SageMakerCluster]

Example

>>> # using private key
>>> gpu = rh.cluster(host='<hostname>',
>>>                  ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'},
>>>                  name='rh-a10x').save()
>>> # using password
>>> gpu = rh.cluster(host='<hostname>',
>>>                  ssh_creds={'ssh_user': '...', 'password':'*****'},
>>>                  name='rh-a10x').save()
>>> # using the name of an SSHSecret object
>>> gpu = rh.cluster(host='<hostname>',
>>>                  ssh_creds="my_ssh_secret",
>>>                  name='rh-a10x').save()
>>> # Load cluster from above
>>> reloaded_cluster = rh.cluster(name="rh-a10x")
runhouse.ondemand_cluster(name: str, instance_type: str | None = None, num_instances: int | None = None, provider: str | None = None, autostop_mins: int | None = None, use_spot: bool = False, image_id: str | None = None, region: str | None = None, memory: int | str | None = None, disk_size: int | str | None = None, open_ports: int | str | List[int] | None = None, server_port: int | None = None, server_host: int | None = None, server_connection_type: ServerConnectionType | str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, domain: str | None = None, den_auth: bool | None = None, dryrun: bool = False, **kwargs) OnDemandCluster[source]#

Builds an instance of OnDemandCluster. Note that image_id, region, memory, disk_size, and open_ports are all passed through to SkyPilot’s Resource constructor.

Parameters:
  • name (str) – Name for the cluster, to re-use later on.

  • instance_type (int, optional) – Type of cloud instance to use for the cluster. This could be a Runhouse built-in type, or your choice of instance type.

  • num_instances (int, optional) – Number of instances to use for the cluster.

  • provider (str, optional) – Cloud provider to use for the cluster.

  • autostop_mins (int, optional) – Number of minutes to keep the cluster up after inactivity, or -1 to keep cluster up indefinitely.

  • use_spot (bool, optional) – Whether or not to use spot instance.

  • image_id (str, optional) – Custom image ID for the cluster.

  • region (str, optional) – The region to use for the cluster.

  • memory (int or str, optional) – Amount of memory to use for the cluster, e.g. “16” or “16+”.

  • disk_size (int or str, optional) – Amount of disk space to use for the cluster, e.g. “100” or “100+”.

  • open_ports (int or str or List[int], optional) – Ports to open in the cluster’s security group. Note that you are responsible for ensuring that the applications listening on these ports are secure.

  • server_port (bool, optional) – Port to use for the server. If not provided will use 80 for a server_connection_type of none, 443 for tls and 32300 for all other SSH connection types.

  • server_host (bool, optional) – Host from which the server listens for traffic (i.e. the –host argument runhouse start run on the cluster). Defaults to “0.0.0.0” unless connecting to the server with an SSH connection, in which case localhost is used.

  • server_connection_type (ServerConnectionType or str, optional) – Type of connection to use for the Runhouse API server. ssh will use start with server via an SSH tunnel. tls will start the server with HTTPS on port 443 using TLS certs without an SSH tunnel. none will start the server with HTTP without an SSH tunnel. aws_ssm will start the server with HTTP using AWS SSM port forwarding.

  • ssl_keyfile (str, optional) – Path to SSL key file to use for launching the API server with HTTPS.

  • ssl_certfile (str, optional) – Path to SSL certificate file to use for launching the API server with HTTPS.

  • domain (str, optional) – Domain name for the cluster. Relevant if enabling HTTPs on the cluster.

  • den_auth (bool, optional) – Whether to use Den authorization on the server. If True, will validate incoming requests with a Runhouse token provided in the auth headers of the request with the format: {"Authorization": "Bearer <token>"}. (Default: False).

  • dryrun (bool) – Whether to create the Cluster if it doesn’t exist, or load a Cluster object as a dryrun. (Default: False)

Returns:

The resulting cluster.

Return type:

OnDemandCluster

Example

>>> import runhouse as rh
>>> # On-Demand SkyPilot Cluster (OnDemandCluster)
>>> gpu = rh.ondemand_cluster(name='rh-4-a100s',
>>>                  instance_type='A100:4',
>>>                  provider='gcp',
>>>                  autostop_mins=-1,
>>>                  use_spot=True,
>>>                  image_id='my_ami_string',
>>>                  region='us-east-1',
>>>                  ).save()
>>> # Load cluster from above
>>> reloaded_cluster = rh.ondemand_cluster(name="rh-4-a100s")
runhouse.sagemaker_cluster(name: str, role: str = None, profile: str = None, ssh_key_path: str = None, instance_id: str = None, instance_type: str = None, num_instances: int = None, image_uri: str = None, autostop_mins: int = None, connection_wait_time: int = None, estimator: sagemaker.estimator.EstimatorBase | Dict = None, job_name: str = None, server_port: int = None, server_host: int = None, server_connection_type: ServerConnectionType | str = None, ssl_keyfile: str = None, ssl_certfile: str = None, domain: str = None, den_auth: bool = False, dryrun: bool = False, **kwargs) SageMakerCluster[source]#

Builds an instance of SageMakerCluster. See SageMaker Hardware Setup section for more specific instructions and requirements for providing the role and setting up the cluster.

Parameters:
  • name (str) – Name for the cluster, to re-use later on.

  • role (str, optional) – An AWS IAM role (either name or full ARN). Can be passed in explicitly as an argument or provided via an estimator. If not specified will try using the profile attribute or environment variable AWS_PROFILE to extract the relevant role ARN. More info on configuring an IAM role for SageMaker here.

  • profile (str, optional) – AWS profile to use for the cluster. If provided instead of a role, will lookup the role ARN associated with the profile in the local AWS credentials. If not provided, will use the default profile.

  • ssh_key_path (str, optional) – Path (relative or absolute) to private SSH key to use for connecting to the cluster. If not provided, will look for the key in path ~/.ssh/sagemaker-ssh-gw. If not found will generate new keys and upload the public key to the default s3 bucket for the Role ARN.

  • instance_id (str, optional) – ID of the AWS instance to use for the cluster. SageMaker does not expose IP addresses of its instance, so we use an instance ID as a unique identifier for the cluster.

  • instance_type (str, optional) –

    Type of AWS instance to use for the cluster. More info on supported instance options here. (Default: ml.m5.large.)

  • num_instances (int, optional) – Number of instances to use for the cluster. (Default: 1.)

  • image_uri (str, optional) – Image to use for the cluster instead of using the default SageMaker image which will be based on the framework_version and py_version. Can be an ECR url or dockerhub image and tag.

  • estimator (Union[str, sagemaker.estimator.EstimatorBase], optional) –

    Estimator to use for a dedicated training job. Leave as None if launching the compute without running a dedicated job. More info on creating an estimator here.

  • autostop_mins (int, optional) – Number of minutes to keep the cluster up after inactivity, or -1 to keep cluster up indefinitely. Note: this will keep the cluster up even if a dedicated training job has finished running or failed.

  • connection_wait_time (int, optional) – Amount of time to wait inside the SageMaker cluster before continuing with normal execution. Useful if you want to connect before a dedicated job starts (e.g. training). If you don’t want to wait, set it to 0. If no estimator is provided, will default to 0.

  • job_name (str, optional) – Name to provide for a training job. If not provided will generate a default name based on the image name and current timestamp (e.g. pytorch-training-2023-08-28-20-57-55-113).

  • server_port (bool, optional) – Port to use for the server (Default: 32300).

  • server_host (bool, optional) – Host from which the server listens for traffic (i.e. the –host argument runhouse start run on the cluster). Note: For SageMaker, since we connect to the Runhouse API server via an SSH tunnel, the only valid host is localhost.

  • server_connection_type (ServerConnectionType or str, optional) – Type of connection to use for the Runhouse API server. Note: For SageMaker, only ``aws_ssm`` is currently valid as the server connection type.

  • ssl_keyfile (str, optional) – Path to SSL key file to use for launching the API server with HTTPS.

  • ssl_certfile (str, optional) – Path to SSL certificate file to use for launching the API server with HTTPS.

  • domain (str, optional) – Domain name for the cluster. Relevant if enabling HTTPs on the cluster.

  • den_auth (bool, optional) – Whether to use Den authorization on the server. If True, will validate incoming requests with a Runhouse token provided in the auth headers of the request with the format: {"Authorization": "Bearer <token>"}. (Default: False).

  • dryrun (bool) – Whether to create the SageMakerCluster if it doesn’t exist, or load a SageMakerCluster object as a dryrun. (Default: False)

Returns:

The resulting cluster.

Return type:

SageMakerCluster

Example

>>> import runhouse as rh
>>> # Launch a new SageMaker instance and keep it up indefinitely.
>>> # Note: This will use Role ARN associated with the "sagemaker" profile defined in the local aws credentials
>>> c = rh.sagemaker_cluster(name='sm-cluster', profile="sagemaker").save()
>>> # Running a training job with a provided Estimator
>>> c = rh.sagemaker_cluster(name='sagemaker-cluster',
>>>                          estimator=PyTorch(entry_point='train.py',
>>>                                            role='arn:aws:iam::123456789012:role/MySageMakerRole',
>>>                                            source_dir='/Users/myuser/dev/sagemaker',
>>>                                            framework_version='1.8.1',
>>>                                            py_version='py36',
>>>                                            instance_type='ml.p3.2xlarge'),
>>>                          ).save()
>>> # Load cluster from above
>>> reloaded_cluster = rh.sagemaker_cluster(name="sagemaker-cluster")

Cluster Class#

class runhouse.Cluster(name: str | None = None, ips: List[str] = None, creds: Secret = None, server_host: str = None, server_port: int = None, ssh_port: int = None, client_port: int = None, server_connection_type: str = None, ssl_keyfile: str = None, ssl_certfile: str = None, domain: str = None, den_auth: bool = False, use_local_telemetry: bool = False, dryrun=False, **kwargs)[source]#
__init__(name: str | None = None, ips: List[str] = None, creds: Secret = None, server_host: str = None, server_port: int = None, ssh_port: int = None, client_port: int = None, server_connection_type: str = None, ssl_keyfile: str = None, ssl_certfile: str = None, domain: str = None, den_auth: bool = False, use_local_telemetry: bool = False, dryrun=False, **kwargs)[source]#

The Runhouse cluster, or system. This is where you can run Functions or access/transfer data between. You can BYO (bring-your-own) cluster by providing cluster IP and ssh_creds, or this can be an on-demand cluster that is spun up/down through SkyPilot, using your cloud credentials.

Note

To build a cluster, please use the factory method cluster().

add_secrets(provider_secrets: List[str], env: str | Env = None)[source]#

Copy secrets from current environment onto the cluster

call(module_name, method_name, *args, stream_logs=True, run_name=None, remote=False, run_async=False, save=False, **kwargs)[source]#

Call a method on a module that is in the cluster’s object store.

Parameters:
  • module_name (str) – Name of the module saved on system.

  • method_name (str) – Name of the method.

  • stream_logs (bool) – Whether to stream logs from the method call.

  • run_name (str) – Name for the run.

  • remote (bool) – Return a remote object from the function, rather than the result proper.

  • run_async (bool) – Run the method asynchronously and return an awaitable.

  • *args – Positional arguments to pass to the method.

  • **kwargs – Keyword arguments to pass to the method.

Example

>>> cluster.call("my_module", "my_method", arg1, arg2, kwarg1=kwarg1)
clear()[source]#

Clear the cluster’s object store.

delete(keys: None | str | List[str])[source]#

Delete the given items from the cluster’s object store. To delete all items, use cluster.clear()

disconnect()[source]#

Disconnect the RPC tunnel.

Example

>>> cluster.disconnect()
download_cert()[source]#

Download certificate from the cluster (Note: user must have access to the cluster)

enable_den_auth(flush=True)[source]#

Enable Den auth on the cluster.

endpoint(external=False)[source]#

Endpoint for the cluster’s Daemon server. If external is True, will only return the external url, and will return None otherwise (e.g. if a tunnel is required). If external is False, will either return the external url if it exists, or will set up the connection (based on connection_type) and return the internal url (including the local connected port rather than the sever port). If cluster is not up, returns None.

get(key: str, default: Any | None = None, remote=False)[source]#

Get the result for a given key from the cluster’s object store. To raise an error if the key is not found, use cluster.get(key, default=KeyError).

install_packages(reqs: List[Package | str], env: Env | str = None)[source]#

Install the given packages on the cluster.

Parameters:
  • reqs (List[Package or str) – List of packages to install on cluster and env

  • env (Env or str) – Environment to install package on. If left empty, defaults to base environment. (Default: None)

Example

>>> cluster.install_packages(reqs=["accelerate", "diffusers"])
>>> cluster.install_packages(reqs=["accelerate", "diffusers"], env="my_conda_env")
is_connected()[source]#

Whether the RPC tunnel is up.

Example

>>> connected = cluster.is_connected()
is_up() bool[source]#

Check if the cluster is up.

Example

>>> rh.cluster("rh-cpu").is_up()
keys(env=None)[source]#

List all keys in the cluster’s object store.

notebook(persist: bool = False, sync_package_on_close: str | None = None, port_forward: int = 8888)[source]#

Tunnel into and launch notebook from the cluster.

Example

>>> rh.cluster("test-cluster").notebook()
on_this_cluster()[source]#

Whether this function is being called on the same cluster.

pause_autostop()[source]#

Context manager to temporarily pause autostop. Mainly for OnDemand clusters, for BYO cluster there is no autostop.

put(key: str, obj: Any, env=None)[source]#

Put the given object on the cluster’s object store at the given key.

put_resource(resource: Resource, state: Dict | None = None, dryrun: bool = False, env=None)[source]#

Put the given resource on the cluster’s object store. Returns the key (important if name is not set).

remove_conda_env(env: str | CondaEnv)[source]#

Remove conda env from the cluster.

Example

>>> rh.ondemand_cluster("rh-cpu").remove_conda_env("my_conda_env")
rename(old_key: str, new_key: str)[source]#

Rename a key in the cluster’s object store.

restart_server(_rh_install_url: str = None, resync_rh: bool = True, restart_ray: bool = True, env: str | Env = None, restart_proxy: bool = False)[source]#

Restart the RPC server.

Parameters:
  • resync_rh (bool) – Whether to resync runhouse. (Default: True)

  • restart_ray (bool) – Whether to restart Ray. (Default: True)

  • env (str or Env, optional) – Specified environment to restart the server on. (Default: None)

  • restart_proxy (bool) – Whether to restart Caddy on the cluster, if configured. (Default: False)

Example

>>> rh.cluster("rh-cpu").restart_server()
run(commands: List[str], env: Env | str = None, stream_logs: bool = True, port_forward: None | int | Tuple[int, int] = None, require_outputs: bool = True, node: str | None = None, run_name: str | None = None, _ssh_mode: str = 'interactive') list[source]#

Run a list of shell commands on the cluster. If run_name is provided, the commands will be sent over to the cluster before being executed and a Run object will be created.

Example

>>> cpu.run(["pip install numpy"])
>>> cpu.run(["pip install numpy"], env="my_conda_env"])
>>> cpu.run(["python script.py"], run_name="my_exp")
>>> cpu.run(["python script.py"], node="3.89.174.234")
run_python(commands: List[str], env: Env | str = None, stream_logs: bool = True, node: str = None, port_forward: int | None = None, run_name: str | None = None)[source]#

Run a list of python commands on the cluster, or a specific cluster node if its IP is provided.

Example

>>> cpu.run_python(['import numpy', 'print(numpy.__version__)'])
>>> cpu.run_python(["print('hello')"])
>>> cpu.run_python(["print('hello')"], node="3.89.174.234")

Note

Running Python commands with nested quotes can be finicky. If using nested quotes, try to wrap the outer quote with double quotes (”) and the inner quotes with a single quote (‘).

save(name: str | None = None, overwrite: bool = True, folder: str | None = None)[source]#

Overrides the default resource save() method in order to also update the cluster config on the cluster itself.

property server_address#

Address to use in the requests made to the cluster. If creating an SSH tunnel with the cluster, ths will be set to localhost, otherwise will use the cluster’s domain (if provided), or its public IP address.

share(users: str | List[str] | None = None, access_level: ResourceAccess | str = ResourceAccess.READ, visibility: ResourceVisibility | str | None = None, notify_users: bool = True, headers: Dict | None = None) Tuple[Dict[str, ResourceAccess], Dict[str, ResourceAccess]][source]#

Grant access to the resource for a list of users (or a single user). If a user has a Runhouse account they will receive an email notifying them of their new access. If the user does not have a Runhouse account they will also receive instructions on creating one, after which they will be able to have access to the Resource. If visibility is set to public, users will not be notified.

Note

You can only grant access to other users if you have write access to the resource.

Parameters:
  • users (Union[str, list], optional) – Single user or list of user emails and / or runhouse account usernames. If none are provided and visibility is set to public, resource will be made publicly available to all users.

  • access_level (ResourceAccess, optional) – Access level to provide for the resource. Defaults to read.

  • visibility (ResourceVisibility, optional) – Type of visibility to provide for the shared resource. Defaults to private.

  • notify_users (bool, optional) – Whether to send an email notification to users who have been given access. Note: This is relevant for resources which are not shareable. Defaults to True.

  • headers (dict, optional) – Request headers to provide for the request to RNS. Contains the user’s auth token. Example: {"Authorization": f"Bearer {token}"}

Returns:

added_users:

Users who already have a Runhouse account and have been granted access to the resource.

new_users:

Users who do not have Runhouse accounts and received notifications via their emails.

valid_users:

Set of valid usernames and emails from users parameter.

Return type:

Tuple(Dict, Dict, Set)

Example

>>> # Write access to the resource for these specific users.
>>> # Visibility will be set to private (users can search for and view resource in Den dashboard)
>>> my_resource.share(users=["username1", "[email protected]"], access_level='write')
>>> # Make resource public, with read access to the resource for all users
>>> my_resource.share(visibility='public')
ssh()[source]#

SSH into the cluster

Example

>>> rh.cluster("rh-cpu").ssh()
status(resource_address: str | None = None)[source]#

Loads the status of the Runhouse daemon running on the cluster.

stop_server(stop_ray: bool = True, env: str | Env = None)[source]#

Stop the RPC server.

Parameters:
  • stop_ray (bool) – Whether to stop Ray. (Default: True)

  • env (str or Env, optional) – Specified environment to stop the server on. (Default: None)

sync_secrets(providers: List[str] | None = None, env: str | Env = None)[source]#

Send secrets for the given providers.

Parameters:

providers (List[str] or None) – List of providers to send secrets for. If None, all providers configured in the environment will by sent.

Example

>>> cpu.sync_secrets(secrets=["aws", "lambda"])
up_if_not()[source]#

Bring up the cluster if it is not up. No-op if cluster is already up. This only applies to on-demand clusters, and has no effect on self-managed clusters.

Example

>>> rh.cluster("rh-cpu").up_if_not()

Cluster Hardware Setup#

No additional setup is required. You will just need to have the IP address for the cluster and the path to SSH credentials ready to be used for the cluster initialization.

OnDemandCluster Class#

A OnDemandCluster is a cluster that uses SkyPilot functionality underneath to handle various cluster properties.

class runhouse.OnDemandCluster(name, instance_type: str | None = None, num_instances: int | None = None, provider: str | None = None, dryrun=False, autostop_mins=None, use_spot=False, image_id=None, memory=None, disk_size=None, open_ports=None, server_host: str | None = None, server_port: int | None = None, server_connection_type: str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, domain: str | None = None, den_auth: bool = False, region=None, **kwargs)[source]#
__init__(name, instance_type: str | None = None, num_instances: int | None = None, provider: str | None = None, dryrun=False, autostop_mins=None, use_spot=False, image_id=None, memory=None, disk_size=None, open_ports=None, server_host: str | None = None, server_port: int | None = None, server_connection_type: str | None = None, ssl_keyfile: str | None = None, ssl_certfile: str | None = None, domain: str | None = None, den_auth: bool = False, region=None, **kwargs)[source]#

On-demand SkyPilot Cluster.

Note

To build a cluster, please use the factory method cluster().

static cluster_ssh_key(path_to_file)[source]#

Retrieve SSH key for the cluster.

Example

>>> ssh_priv_key = rh.ondemand_cluster("rh-cpu").cluster_ssh_key("~/.ssh/id_rsa")
endpoint(external=False)[source]#

Endpoint for the cluster’s Daemon server. If external is True, will only return the external url, and will return None otherwise (e.g. if a tunnel is required). If external is False, will either return the external url if it exists, or will set up the connection (based on connection_type) and return the internal url (including the local connected port rather than the sever port). If cluster is not up, returns None.

is_up() bool[source]#

Whether the cluster is up.

Example

>>> rh.ondemand_cluster("rh-cpu").is_up()
keep_warm(autostop_mins: int = -1)[source]#

Keep the cluster warm for given number of minutes after inactivity.

Parameters:

autostop_mins (int) – Amount of time (in min) to keep the cluster warm after inactivity. If set to -1, keep cluster warm indefinitely. (Default: -1)

pause_autostop()[source]#

Context manager to temporarily pause autostop.

Example

>>> with rh.ondemand_cluster.pause_autostop():
>>>     rh.ondemand_cluster.run(["python train.py"])
ssh(node: str | None = None)[source]#

SSH into the cluster. If no node is specified, will SSH onto the head node.

Example

>>> rh.ondemand_cluster("rh-cpu").ssh()
>>> rh.ondemand_cluster("rh-cpu", node="3.89.174.234").ssh()
teardown()[source]#

Teardown cluster.

Example

>>> rh.ondemand_cluster("rh-cpu").teardown()
teardown_and_delete()[source]#

Teardown cluster and delete it from configs.

Example

>>> rh.ondemand_cluster("rh-cpu").teardown_and_delete()
up()[source]#

Up the cluster.

Example

>>> rh.ondemand_cluster("rh-cpu").up()

OnDemandCluster Hardware Setup#

On-Demand clusters use SkyPilot to automatically spin up and down clusters on the cloud. You will need to first set up cloud access on your local machine:

Run sky check to see which cloud providers are enabled, and how to set up cloud credentials for each of the providers.

sky check

For a more in depth tutorial on setting up individual cloud credentials, you can refer to SkyPilot setup docs.

SageMakerCluster Class#

Note

SageMaker support is an alpha and under active development. Please report any bugs or let us know of any feature requests.

A SageMakerCluster is a cluster that uses a SageMaker instance under the hood.

Runhouse currently supports two core usage paths for SageMaker clusters:

  • Compute backend: You can use SageMaker as a compute backend, just as you would a BYO (bring-your-own) or an on-demand cluster. Runhouse will handle launching the SageMaker compute and creating the SSH connection to the cluster.

  • Dedicated training jobs: You can use a SageMakerCluster class to run a training job on SageMaker compute. To do so, you will need to provide an estimator.

Note

Runhouse requires an AWS IAM role (either name or full ARN) whose credentials have adequate permissions to create create SageMaker endpoints and access AWS resources.

Please see SageMaker Hardware Setup for more specific instructions and requirements for providing the role and setting up the cluster.

class runhouse.SageMakerCluster(name: str, role: str = None, profile: str = None, region: str = None, ssh_key_path: str = None, instance_id: str = None, instance_type: str = None, num_instances: int = None, image_uri: str = None, autostop_mins: int = None, connection_wait_time: int = None, estimator: EstimatorBase | Dict = None, job_name: str = None, server_host: str = None, server_port: int = None, domain: str = None, server_connection_type: str = None, ssl_keyfile: str = None, ssl_certfile: str = None, den_auth: bool = False, dryrun=False, **kwargs)[source]#
__init__(name: str, role: str = None, profile: str = None, region: str = None, ssh_key_path: str = None, instance_id: str = None, instance_type: str = None, num_instances: int = None, image_uri: str = None, autostop_mins: int = None, connection_wait_time: int = None, estimator: EstimatorBase | Dict = None, job_name: str = None, server_host: str = None, server_port: int = None, domain: str = None, server_connection_type: str = None, ssl_keyfile: str = None, ssl_certfile: str = None, den_auth: bool = False, dryrun=False, **kwargs)[source]#

The Runhouse SageMaker cluster abstraction. This is where you can use SageMaker as a compute backend, just as you would an on-demand cluster (i.e. cloud VMs) or a BYO (i.e. on-prem) cluster. Additionally supports running dedicated training jobs using SageMaker Estimators.

Note

To build a cluster, please use the factory method sagemaker_cluster().

property connection_wait_time#

Amount of time the SSH helper will wait inside SageMaker before it continues normal execution

property default_bucket#

Default bucket to use for storing the cluster’s authorized public keys.

is_up() bool[source]#

Check if the cluster is up.

Example

>>> rh.sagemaker_cluster("sagemaker-cluster").is_up()
keep_warm(autostop_mins: int = -1)[source]#

Keep the cluster warm for given number of minutes after inactivity.

Parameters:

autostop_mins (int) – Amount of time (in minutes) to keep the cluster warm after inactivity. If set to -1, keep cluster warm indefinitely. (Default: -1)

pause_autostop()[source]#

Context manager to temporarily pause autostop.

restart_server(_rh_install_url: str = None, resync_rh: bool = True, restart_ray: bool = True, env: str | Env = None, restart_proxy: bool = False)[source]#

Restart the RPC server on the SageMaker instance.

Parameters:
  • resync_rh (bool) – Whether to resync runhouse. (Default: True)

  • restart_ray (bool) – Whether to restart Ray. (Default: True)

  • env (str or Env) – Env to restart the server from. If not provided will use default env on the cluster.

  • restart_proxy (bool) – Whether to restart nginx on the cluster, if configured. (Default: False)

Example

>>> rh.sagemaker_cluster("sagemaker-cluster").restart_server()
ssh(interactive: bool = True)[source]#

SSH into the cluster.

Parameters:

interactive (bool) – Whether to start an interactive shell or not (Default: True).

Example

>>> rh.sagemaker_cluster(name="sagemaker-cluster").ssh()
property ssh_key_path#

Relative path to the private SSH key used to connect to the cluster.

status() dict[source]#

Get status of SageMaker cluster.

Example

>>> status = rh.sagemaker_cluster("sagemaker-cluster").status()
teardown()[source]#

Teardown the SageMaker instance.

Example

>>> rh.sagemaker_cluster(name="sagemaker-cluster").teardown()
teardown_and_delete()[source]#

Teardown the SageMaker instance and delete from RNS configs.

Example

>>> rh.sagemaker_cluster(name="sagemaker-cluster").teardown_and_delete()
up()[source]#

Up the cluster.

Example

>>> rh.sagemaker_cluster("sagemaker-cluster").up()
up_if_not()[source]#

Bring up the cluster if it is not up. No-op if cluster is already up.

Example

>>> rh.sagemaker_cluster("sagemaker-cluster").up_if_not()

SageMaker Hardware Setup#

IAM Role#

SageMaker clusters require AWS CLI V2 and configuring the SageMaker IAM role with the AWS Systems Manager.

In order to launch a cluster, you must grant SageMaker the necessary permissions with an IAM role, which can be provided either by name or by full ARN. You can also specify a profile explicitly or with the AWS_PROFILE environment variable.

For example, let’s say your local ~/.aws/config file contains:

[profile sagemaker]
role_arn = arn:aws:iam::123456789:role/service-role/AmazonSageMaker-ExecutionRole-20230717T192142
region = us-east-1
source_profile = default

There are several ways to provide the necessary credentials when initializing the cluster:

  • Providing the AWS profile name: profile="sagemaker"

  • Providing the AWS Role ARN directly: role="arn:aws:iam::123456789:role/service-role/AmazonSageMaker-ExecutionRole-20230717T192142"

  • Environment Variable: setting AWS_PROFILE to "sagemaker"

Note

If no role or profile is provided, Runhouse will try using the default profile. Note if this default AWS identity is not a role, then you will need to provide the role or profile explicitly.

Tip

If you are providing an estimator, you must provide the role ARN explicitly as part of the estimator object. More info on estimators here.

Please see the AWS docs for further instructions on creating and configuring an ARN Role.

AWS CLI V2#

The SageMaker SDK uses AWS CLI V2, which must be installed on your local machine. Doing so requires one of two steps:

To confirm the installation succeeded, run aws --version in the command line. You should see something like:

aws-cli/2.13.8 Python/3.11.4 Darwin/21.3.0 source/arm64 prompt/off

If you are still seeing the V1 version, first try uninstalling V1 in case it is still present (e.g. pip uninstall awscli).

You may also need to add the V2 executable to the PATH of your python environment. For example, if you are using conda, it’s possible the conda env will try using its own version of the AWS CLI located at a different path (e.g. /opt/homebrew/anaconda3/bin/aws), while the system wide installation of AWS CLI is located somewhere else (e.g. /opt/homebrew/bin/aws).

To find the global AWS CLI path:

which aws

To ensure that the global AWS CLI version is used within your python environment, you’ll need to adjust the PATH environment variable so that it prioritizes the global AWS CLI path.

export PATH=/opt/homebrew/bin:$PATH

SSM Setup#

The AWS Systems Manager service is used to create SSH tunnels with the SageMaker cluster.

To install the AWS Session Manager Plugin, please see the AWS docs or SageMaker SSH Helper. The SSH Helper package simplifies the process of creating SSH tunnels with SageMaker clusters. It is installed by default if you are installing Runhouse with the SageMaker dependency: pip install runhouse[sagemaker].

You can also install the Session Manager by running the CLI command:

sm-local-configure

To configure your SageMaker IAM role with the AWS Systems Manager, please refer to these instructions.

Cluster Authentication & Verification#

Runhouse provides a couple of options to manage the connection to the Runhouse API server running on a cluster.

Server Connection#

The below options can be specified with the server_connection_type parameter when initializing a cluster. By default the Runhouse API server will be started on the cluster on port 32300.

  • ssh: Connects to the cluster via an SSH tunnel, by default on port 32300.

  • tls: Connects to the cluster via HTTPS (by default on port 443) using either a provided certificate, or creating a new self-signed certificate just for this cluster. You must open the needed ports in the firewall, such as via the open_ports argument in the OnDemandCluster, or manually in the compute itself or cloud console.

  • none: Does not use any port forwarding or enforce any authentication. Connects to the cluster with HTTP by default on port 80. This is useful when connecting to a cluster within a VPC, or creating a tunnel manually on the side with custom settings.

  • aws_ssm: Uses the AWS Systems Manager to create an SSH tunnel to the cluster, by default on port 32300. Note: this is currently only relevant for SageMaker Clusters.

Note

The tls connection type is the most secure and is recommended for production use if you are not running inside of a VPC. However, be mindful that you must secure the cluster with authentication (see below) if you open it to the public internet.

Server Authentication#

If desired, Runhouse provides out-of-the-box authentication via users’ Runhouse token (generated when logging in) and set locally at: ~/.rh/config.yaml). This is crucial if the cluster has ports open to the public internet, as would usually be the case when using the tls connection type. You may also set up your own authentication manually inside of your own code, but you should likely still enable Runhouse authentication to ensure that even your non-user-facing endpoints into the server are secured.

When initializing a cluster, you can set the den_auth parameter to True to enable token authentication. Calls to the cluster server can then be made using an auth header with the format: {"Authorization": "Bearer <cluster-token>"}. The Runhouse Python library adds this header to its calls automatically, so your users do not need to worry about it after logging into Runhouse.

Note

Runhouse never uses your default Runhouse token for anything other than requests made to Runhouse Den. Your token will never be exposed or shared with anyone else.

TLS Certificates#

Enabling TLS and Runhouse Den Dashboard Auth for the API server makes it incredibly fast and easy to stand up a microservice with standard token authentication, allowing you to easily share Runhouse resources with collaborators, teams, customers, etc.

Let’s illustrate this with a simple example:

import runhouse as rh

def concat(a: str, b: str):
    return a + b

# Launch a cluster with TLS and Den Auth enabled
cpu = rh.ondemand_cluster(instance_type="m5.xlarge",
                          provider="aws",
                          name="rh-cluster",
                          den_auth=True,
                          open_ports=[443],
                          server_connection_type="tls").up_if_not()

# Remote function stub which lives on the cluster
remote_func = rh.function(concat).to(cpu)

# Save to Runhouse Den
remote_func.save()

# Give read access to the function to another user - this will allow them to call this service remotely
# and view the function metadata in Runhouse Den
remote_func.share("[email protected]", access_level="read")

# This other user (user1) can then call the function remotely from any python environment
res = remote_func("run", "house")
>> print(res)
>> "runhouse"

We can also call the function via an HTTP request, making it easy for other users to call the function with a Runhouse cluster token (Note: this assumes the user has been granted access to the function or write access to the cluster):

curl -X GET "https://<DOMAIN>/concat/call?a=run&b=house"
-H "Content-Type: application/json" -H "Authorization: Bearer <CLUSTER-TOKEN>"

Caddy#

Runhouse gives you the option of using Caddy as a reverse proxy for the Runhouse API server, which is a FastAPI app launched with Uvicorn. Using Caddy provides you with a safer and more conventional approach running the FastAPI app on a higher, non-privileged port (such as 32300, the default Runhouse port) and then use Caddy as a reverse proxy to forward requests from the HTTP port (default: 80) or the HTTPS port (default: 443).

Caddy also enables generating and auto-renewing self-signed certificates, making it easy to secure your cluster with HTTPS right out of the box.

Note

Caddy is enabled by default when you launch a cluster with the server_port set to either 80 or 443.

Generating Certs

Runhouse offers two options for enabling TLS/SSL on a cluster with Caddy:

  1. Using existing certs: provide the path to the cert and key files with the ssl_certfile and ssl_keyfile arguments. These certs will be used by Caddy as specified in the Caddyfile on the cluster. If no cert paths are provided and no domain is specified, Runhouse will issue self-signed certificates to use for the cluster. These certs will not be verified by a CA.

  2. Using Caddy to generate CA verified certs: Provide the domain argument. Caddy will then obtain certificates from Let’s Encrypt on-demand when a client connects for the first time.

Using a Custom Domain#

Runhouse also supports custom domains for deploying your apps and services. In order to configure a domain, make sure to first add the relevant A record to your domain’s DNS settings. Once the cluster is up, you can add a new A record with its public IP address.

Note

You’ll need to also sure the relevant ports are open (ex: 443) in the security group settings of the cluster. Runhouse will also automatically set up a TLS certificate for the domain via Caddy.

Once the server is up, you can include its IP and domain when initializing the Runhouse cluster object:

cluster = rh.cluster(name="rh-serving-cpu",
                     ips=["<public IP>"],
                     domain="<your domain>",
                     server_connection_type="tls",
                     open_ports=[443]).up_if_not()

Now we can send modules or functions to our cluster and seamlessly create endpoints which we can then share and call from anywhere.

Let’s take a look at an example of how to deploy a simple LangChain RAG app.

Once the app has been created and sent to the cluster, we can call it via HTTP directly:

import requests

resp = requests.get("https://<domain>/basic_rag_app/invoke?user_prompt=<prompt>")
print(resp.json())

Or via cURL:

curl "https://<domain>/basic_rag_app/invoke?user_prompt=<prompt>"