Table#

A Table is a Runhouse primitive used for abstracting a particular tabular data storage configuration.

Table Factory Method#

runhouse.table(data=None, name: str | None = None, path: str | None = None, system: str | None = None, data_config: dict | None = None, partition_cols: list | None = None, mkdir: bool = False, dryrun: bool = False, stream_format: str | None = None, metadata: dict | None = None) Table[source]#

Constructs a Table object, which can be used to interact with the table at the given path.

Parameters:
  • data – Data to be stored in the table.

  • name (Optional[str]) – Name for the table, to reuse it later on.

  • path (Optional[str]) – Full path to the data file.

  • system (Optional[str]) – File system. Currently this must be one of: [file, github, sftp, ssh, s3, gs, azure].

  • data_config (Optional[dict]) – The data config to pass to the underlying fsspec handler.

  • partition_cols (Optional[list]) – List of columns to partition the table by.

  • mkdir (bool) – Whether to create a remote folder for the table. (Default: False)

  • dryrun (bool) – Whether to create the Table if it doesn’t exist, or load a Table object as a dryrun. (Default: False)

  • stream_format (Optional[str]) – Format to stream the Table as. Currently this must be one of: [pyarrow, torch, tf, pandas]

  • metadata (Optional[dict]) – Metadata to store for the table.

Returns:

The resulting Table object.

Return type:

Table

Example

>>> import runhouse as rh
>>> # Create and save (pandas) table
>>> rh.table(
>>>    data=data,
>>>    name="~/my_test_pandas_table",
>>>    path="table_tests/test_pandas_table.parquet",
>>>    system="file",
>>>    mkdir=True,
>>> ).save()
>>>
>>> # Load table from above
>>> reloaded_table = rh.table(name="~/my_test_pandas_table")

Table Class#

class runhouse.Table(path: str, name: str | None = None, file_name: str | None = None, system: str | None = None, data_config: dict | None = None, dryrun: bool = False, partition_cols: List | None = None, stream_format: str | None = None, metadata: Dict | None = None, **kwargs)[source]#
__init__(path: str, name: str | None = None, file_name: str | None = None, system: str | None = None, data_config: dict | None = None, dryrun: bool = False, partition_cols: List | None = None, stream_format: str | None = None, metadata: Dict | None = None, **kwargs)[source]#

The Runhouse Table object.

Note

To build a Table, please use the factory method table().

property data: Dataset#

Get the table data. If data is not already cached, return a Ray dataset.

With the dataset object we can stream or convert to other types, for example:

data.iter_batches()
data.to_pandas()
data.to_dask()
exists_in_system()[source]#

Whether the table exists in file system.

Example

>>> table.exists_in_system()
fetch(columns: list | None = None) pa.Table[source]#

Returns the complete table contents.

Example

>>> table = rh.table(data)
>>> fomratted_data = table.fetch()
read_table_from_file(columns: list | None = None)[source]#

Read a table from it’s path.

Example

>>> table = rh.table(path="path/to/table")
>>> table_data = table.read_table_from_file()
rm(recursive: bool = True)[source]#

Delete table, including its partitioned files where relevant.

Example

>>> table = rh.table(path="path/to/table")
>>> table.rm()
stream(batch_size: int, drop_last: bool = False, shuffle_seed: int | None = None, shuffle_buffer_size: int | None = None, prefetch_batches: int | None = None)[source]#

Return a local batched iterator over the ray dataset.

Example

>>> table = rh.table(data)
>>> batches = table.stream(batch_size=4)
>>> for _, batch in batches:
>>>     print(batch)
to(system, path=None, data_config=None)[source]#

Copy and return the table on the given filesystem and path.

Example

>>> local_table = rh.table(data, path="local/path")
>>> s3_table = local_table.to("s3")
>>> cluster_table = local_table.to(my_cluster)
write()[source]#

Write underlying table data to fsspec URL.

Example

>>> rh.table(data, path="path/to/write").write()