Distributed Compaction¶
As Lance datasets evolve over time (e.g., frequent appends / overwrites), they can accumulate many small fragments. Compaction rewrites fragments into fewer, larger fragments to improve scan and query performance.
Lance-Ray provides a distributed compaction workflow backed by Ray workers.
compact_files¶
compact_files(
uri=None,
*,
table_id=None,
compaction_options=None,
num_workers=4,
storage_options=None,
namespace_impl=None,
namespace_properties=None,
ray_remote_args=None,
)
Compact files in a single Lance table (dataset) using distributed Ray workers.
Parameters:
uri: Dataset URI to compact (eitheruriORnamespace_impl+table_idrequired)table_id: Table identifier as a list of strings (requiresnamespace_impl)compaction_options: Optionallance.optimize.CompactionOptionsinstancenum_workers: Number of Ray workers to use (default: 4)storage_options: Optional storage configuration dictionarynamespace_impl: Namespace implementation type (e.g.,"rest","dir")namespace_properties: Properties for connecting to the namespaceray_remote_args: Optional kwargs for Ray remote tasks
Returns: CompactionMetrics or None if no compaction tasks are needed.
compact_database¶
compact_database(
*,
database,
namespace_impl,
namespace_properties=None,
compaction_options=None,
num_workers=4,
storage_options=None,
ray_remote_args=None,
)
Compact all tables under a given database (namespace).
This function lists tables under database via the namespace API and runs compact_files on each table.
Parameters:
database: Database (namespace) identifier as a list of path segments, e.g.['my_database']namespace_impl: Namespace implementation type (e.g.,"rest","dir")namespace_properties: Properties for connecting to the namespacecompaction_options: Optionallance.optimize.CompactionOptionsinstance (applied to every table)num_workers: Number of Ray workers per table (default: 4)storage_options: Optional storage configuration dictionaryray_remote_args: Optional kwargs for Ray remote tasks
Returns: A list of dictionaries, one per table, with keys table_id (the full table identifier) and metrics (the compaction result, or None if no compaction was needed).
Examples¶
Compact a single table by URI¶
import lance_ray as lr
metrics = lr.compact_files(
uri="/path/to/table.lance",
num_workers=4,
)
print(metrics)
Compact a table via namespace¶
import lance_ray as lr
metrics = lr.compact_files(
uri=None,
namespace_impl="dir",
namespace_properties={"root": "/path/to/tables"},
table_id=["my_table"],
num_workers=2,
)
print(metrics)
Compact an entire database¶
from lance.optimize import CompactionOptions
import lance_ray as lr
results = lr.compact_database(
database=["my_db"],
namespace_impl="dir",
namespace_properties={"root": "/path/to/tables"},
compaction_options=CompactionOptions(target_rows_per_fragment=10000),
num_workers=2,
)
for item in results:
print(item["table_id"], item["metrics"])