Performance Tuning¶
This guide covers performance tuning for Lance Spark operations in large-scale ETL and batch processing scenarios.
Understanding Lance's Default Optimization¶
Lance is optimized by default for random access patterns - fast point lookups, vector searches, and selective column reads. These defaults work well for ML/AI workloads where you frequently access individual records or small batches.
For large-scale batch ETL and scan-heavy OLAP operations (writing millions of rows, full table scans, bulk exports), you can tune Lance's environment variables and Spark options to better utilize available resources.
General Recommendations¶
For optimal performance with Lance, we recommend:
-
Use fixed-size lists for vector columns - Store embedding vectors as
ARRAY<FLOAT>with a fixed dimension. This enables efficient SIMD operations and better compression. See Vector Columns. -
Use blob encoding for multimodal data - Store large binary data (images, audio, video) using Lance's unique blob encoding to enable efficient access. See Blob Columns.
-
Use large varchar for large text - When storing very large string values for use cases like full text search, use the large varchar type to avoid out of memory issues. See Large String Columns.
Write Performance¶
Upload Concurrency¶
Set via environment variable LANCE_UPLOAD_CONCURRENCY (default: 10).
Controls the number of concurrent multipart upload streams to S3. Increasing this to match your CPU core count can improve throughput.
Upload Part Size¶
Set via environment variable LANCE_INITIAL_UPLOAD_SIZE (default: 5MB).
Controls the initial part size for S3 multipart uploads. Larger part sizes reduce the number of API calls and can improve throughput for large writes. However, larger part sizes use more memory and may increase latency for small writes. Use the default for interactive workloads.
Note
Lance automatically increments the multipart upload size by 5MB every 100 uploads, so large file writes progressively use increasingly large upload parts. There is no configuration for a fixed upload size.
Queued Write Buffer (Experimental)¶
Set via Spark write option use_queued_write_buffer (default: false).
Enables a buffered write mode that improves throughput for large batch writes. When enabled, Lance uses a queue-based buffer instead of the default semaphore-based buffer that batches data more efficiently before writing to storage, avoiding small lock waiting operations between Spark producer and Lance writer.
Note
This feature is currently in experimental mode. It will be set to true by default after the community considers it mature.
df.write \
.format("lance") \
.option("use_queued_write_buffer", "true") \
.mode("append") \
.saveAsTable("my_table")
Max Batch Bytes¶
Set via Spark write option max_batch_bytes (default: 268435456, i.e. 256MB).
Controls the maximum in-memory size of each Arrow batch before it is flushed.
Batches are flushed when either the row count reaches batch_size or the allocated memory
reaches max_batch_bytes, whichever comes first.
This prevents OOM when writing tables with very large rows — wide schemas, large binary/string columns, or high-dimensional vector embeddings — where even a modest number of rows can exhaust memory before the row-count threshold is reached.
Small-row workloads are unaffected because the row-count limit is hit first.
Note
When using the queued write buffer, total in-flight Arrow memory can be roughly
queue_depth * max_batch_bytes. You may need to tune queue_depth and max_batch_bytes
together to stay within memory limits.
df.write \
.format("lance") \
.option("max_batch_bytes", "134217728") \
.mode("append") \
.saveAsTable("my_table") # 128MB per batch
Max Rows Per File¶
Set via Spark write option max_row_per_file (default: 1,000,000).
Controls the maximum number of rows per Lance fragment file. There is no specific recommended value, but be aware the default is 1 million rows. If you store many multimodal data columns (images, audio, embeddings) without using Lance blob encoding, or store a lot of long text columns, the file size might become very large. From Lance's perspective, having very large files does not impact your read performance. But you may want to reduce this value depending on the limits in your choice of object storage.
df.write \
.format("lance") \
.option("max_row_per_file", "500000") \
.mode("append") \
.saveAsTable("my_table")
Large Var Types¶
Set via Spark write option use_large_var_types (default: false).
Switches all string and binary columns to use 64-bit offset Arrow vectors
(LargeVarCharVector / LargeVarBinaryVector) instead of the default 32-bit offset vectors.
This removes the 2GB-per-batch data buffer limit that can cause OversizedAllocationException
when writing rows with very large string or binary values.
Use this when your rows contain large values (hundreds of KB or more per row) and you hit the 2GB overflow error. There is no meaningful performance overhead -- the only difference is 8 bytes per row for the offset buffer instead of 4.
df.write \
.format("lance") \
.option("use_large_var_types", "true") \
.mode("append") \
.saveAsTable("my_table")
Note
For per-column control at table creation time, see the
arrow.large_var_char table property.
Read Performance¶
I/O Threads¶
Set via environment variable LANCE_IO_THREADS (default: 64).
Controls the number of I/O threads used for parallel reads from storage. For large scans, increasing this to match your CPU core count enables more concurrent S3 requests.
Caching¶
Lance Spark uses a multi-level caching strategy to minimize redundant I/O and improve query performance.
Note
Caches are isolated per Spark catalog. For multi-tenant deployments, configure one Lance catalog per tenant to provide complete cache and credential isolation. If you configure multiple catalogs, total memory usage is multiplied by the number of catalogs. For example, with default settings (6GB index + 1GB metadata) and 3 catalogs, the maximum cache memory is 21GB (3 × 7GB).
How Caching Works¶
Lance Spark implements two levels of caching:
-
Session Cache - Contains index and metadata caches:
- Index Cache: Caches opened vector indices, fragment reuse indices, and index metadata
- Metadata Cache: Caches manifests, transactions, deletion files, row ID indices, and file metadata
-
Dataset Cache - Caches opened datasets by
(catalog, URI, version)key. Since a dataset at a specific version is immutable, this ensures:- Each dataset is opened only once per worker
- All workers read the same version for snapshot isolation
- Fragments are pre-loaded and cached per dataset
Index Cache Size¶
Set via environment variable LANCE_INDEX_CACHE_SIZE (default: 6GB from Lance native).
Controls the size of the index cache in bytes. The index cache stores vector indices which can be large but provide significant speedup for vector search queries. Increase this if you frequently query tables with vector indices.
Metadata Cache Size¶
Set via environment variable LANCE_METADATA_CACHE_SIZE (default: 1GB from Lance native).
Controls the size of the metadata cache in bytes. The metadata cache stores manifests, file metadata, and other dataset metadata. Each column's metadata can be around 40MB, so increase this if your tables have many columns.