Table#
A Table is a Runhouse primitive used for abstracting a particular tabular data storage configuration.
Table Factory Method#
- runhouse.table(data=None, name: str | None = None, path: str | None = None, system: str | None = None, data_config: dict | None = None, partition_cols: list | None = None, mkdir: bool = False, dryrun: bool = False, stream_format: str | None = None, metadata: dict | None = None) Table [source]#
Constructs a Table object, which can be used to interact with the table at the given path.
- Parameters:
data β Data to be stored in the table.
name (Optional[str]) β Name for the table, to reuse it later on.
path (Optional[str]) β Full path to the data file.
system (Optional[str]) β File system. Currently this must be one of: [
file
,github
,sftp
,ssh
,s3
,gs
,azure
].data_config (Optional[dict]) β The data config to pass to the underlying fsspec handler.
partition_cols (Optional[list]) β List of columns to partition the table by.
mkdir (bool) β Whether to create a remote folder for the table. (Default:
False
)dryrun (bool) β Whether to create the Table if it doesnβt exist, or load a Table object as a dryrun. (Default:
False
)stream_format (Optional[str]) β Format to stream the Table as. Currently this must be one of: [
pyarrow
,torch
,tf
,pandas
]metadata (Optional[dict]) β Metadata to store for the table.
- Returns:
The resulting Table object.
- Return type:
Example
>>> import runhouse as rh >>> # Create and save (pandas) table >>> rh.table( >>> data=data, >>> name="~/my_test_pandas_table", >>> path="table_tests/test_pandas_table.parquet", >>> system="file", >>> mkdir=True, >>> ).save() >>> >>> # Load table from above >>> reloaded_table = rh.table(name="~/my_test_pandas_table")
Table Class#
- class runhouse.Table(path: str, name: str | None = None, file_name: str | None = None, system: str | None = None, data_config: dict | None = None, dryrun: bool = False, partition_cols: List | None = None, stream_format: str | None = None, metadata: Dict | None = None, **kwargs)[source]#
- __init__(path: str, name: str | None = None, file_name: str | None = None, system: str | None = None, data_config: dict | None = None, dryrun: bool = False, partition_cols: List | None = None, stream_format: str | None = None, metadata: Dict | None = None, **kwargs)[source]#
The Runhouse Table object.
Note
To build a Table, please use the factory method
table()
.
- property data: Dataset#
Get the table data. If data is not already cached, return a Ray dataset.
With the dataset object we can stream or convert to other types, for example:
data.iter_batches() data.to_pandas() data.to_dask()
- exists_in_system()[source]#
Whether the table exists in file system.
Example
>>> table.exists_in_system()
- fetch(columns: list | None = None) pa.Table [source]#
Returns the complete table contents.
Example
>>> table = rh.table(data) >>> fomratted_data = table.fetch()
- read_table_from_file(columns: list | None = None)[source]#
Read a table from itβs path.
Example
>>> table = rh.table(path="path/to/table") >>> table_data = table.read_table_from_file()
- rm(recursive: bool = True)[source]#
Delete table, including its partitioned files where relevant.
Example
>>> table = rh.table(path="path/to/table") >>> table.rm()
- stream(batch_size: int, drop_last: bool = False, shuffle_seed: int | None = None, shuffle_buffer_size: int | None = None, prefetch_batches: int | None = None)[source]#
Return a local batched iterator over the ray dataset.
Example
>>> table = rh.table(data) >>> batches = table.stream(batch_size=4) >>> for _, batch in batches: >>> print(batch)