INSERT INTO¶
Add data to existing Lance tables using SQL or DataFrames.
Basic Insert¶
# Create new data
new_data = [
(8, "Henry", "henry@example.com"),
(9, "Ivy", "ivy@example.com")
]
new_df = spark.createDataFrame(new_data, ["id", "name", "email"])
# Append to existing table
new_df.writeTo("users").append()
# Alternative: use traditional write API with mode
new_df.write.mode("append").saveAsTable("users")
// Create new data
val newData = Seq(
(8, "Henry", "henry@example.com"),
(9, "Ivy", "ivy@example.com")
)
val newDF = newData.toDF("id", "name", "email")
// Append to existing table
newDF.writeTo("users").append()
// Alternative: use traditional write API with mode
newDF.write.mode("append").saveAsTable("users")
// Create new data
List<Row> newData = Arrays.asList(
RowFactory.create(8L, "Henry", "henry@example.com"),
RowFactory.create(9L, "Ivy", "ivy@example.com")
);
Dataset<Row> newDF = spark.createDataFrame(newData, schema);
// Append to existing table
newDF.writeTo("users").append();
// Alternative: use traditional write API with mode
newDF.write().mode("append").saveAsTable("users");
Insert with Column Specification¶
Insert from SELECT¶
Insert with Complex Data Types¶
Writing Vector Data¶
If you created a table with the arrow.fixed-size-list.size property (see CREATE TABLE),
subsequent writes will automatically use FixedSizeList. No additional configuration is needed:
# Table was created with: 'embeddings.arrow.fixed-size-list.size' = '128'
# Subsequent writes automatically use FixedSizeList encoding
# Plain schema WITHOUT metadata - it just works!
data = [(i, [float(j) for j in range(128)]) for i in range(10, 20)]
df = spark.createDataFrame(data, ["id", "embeddings"])
df.writeTo("embeddings_table").append()
// Table was created with: 'embeddings.arrow.fixed-size-list.size' = '128'
// Subsequent writes automatically use FixedSizeList encoding
List<Row> rows = new ArrayList<>();
for (int i = 10; i < 20; i++) {
float[] vector = new float[128];
for (int j = 0; j < 128; j++) {
vector[j] = random.nextFloat();
}
rows.add(RowFactory.create(i, vector));
}
Dataset<Row> df = spark.createDataFrame(rows, schema);
df.writeTo("embeddings_table").append();
Writing Blob Data¶
If you created a table with the lance.encoding property (see CREATE TABLE),
subsequent writes will automatically use blob encoding. No additional configuration is needed:
# Table was created with: 'content.lance.encoding' = 'blob'
# Subsequent writes automatically use blob encoding
# Plain schema WITHOUT metadata - it just works!
data = [(i, f"Document {i}", bytearray(b"Large content..." * 10000)) for i in range(10, 20)]
df = spark.createDataFrame(data, ["id", "title", "content"])
df.writeTo("documents").append()
// Table was created with: 'content.lance.encoding' = 'blob'
// Subsequent writes automatically use blob encoding
List<Row> data = new ArrayList<>();
for (int i = 10; i < 20; i++) {
byte[] content = new byte[100000];
Arrays.fill(content, (byte) 0x42);
data.add(RowFactory.create(i, "Document " + i, content));
}
Dataset<Row> df = spark.createDataFrame(data, schema);
df.writeTo("documents").append();
Copying blob v2 columns¶
Blob v2 columns read as descriptor structs. Copy them with a direct select:
Only direct column references (or simple aliases) from a single Lance source table are
copied this way; ORDER BY and LIMIT on the SELECT are supported. writeTo().append()
and .overwrite() work the same way. Transformed values (e.g. content.size),
DISTINCT, GROUP BY, UNION, ordering by the blob column itself, and joins of multiple
blob sources keep their descriptor semantics and cannot be inserted into a blob column. See
CREATE TABLE for creating a new
table from a blob v2 query.
On Spark 3.5+, MERGE INTO and UPDATE deep-copy blob v2 columns the same way: a blob
assigned from a source column (SET t.content = s.content) is copied from the source table,
and blobs the command does not touch are carried forward from the target. Additional
limitations apply:
- Assigning a binary literal to a blob v2 column through
UPDATEorMERGEis rejected at analysis; useINSERTfor direct binary writes. - Dynamic partition overwrite is not supported.
- On Spark 3.4,
MERGEandUPDATEare not supported for blob v2 tables.
Writing Large String Data¶
If you created a table with the arrow.large_var_char property (see CREATE TABLE),
subsequent writes will automatically use LargeVarCharVector. No additional configuration is needed:
// Table was created with: 'content.arrow.large_var_char' = 'true'
// Subsequent writes automatically use large string encoding
String longContent = String.join("", Collections.nCopies(100000, "Long content..."));
List<Row> data = new ArrayList<>();
for (int i = 10; i < 20; i++) {
data.add(RowFactory.create(i, "Article " + i, longContent));
}
Dataset<Row> df = spark.createDataFrame(data, schema);
df.writeTo("articles").append();
Write Options¶
These options control how data is written to Lance datasets. They can be set using the .option() method when writing data.
| Option | Type | Default | Description |
|---|---|---|---|
write_mode |
String | append |
Write mode: append to add to existing data, overwrite to replace existing data. |
max_row_per_file |
Integer | - | Maximum number of rows per Lance file. |
max_rows_per_group |
Integer | - | Maximum number of rows per row group within a file. |
max_bytes_per_file |
Long | - | Maximum size in bytes per Lance file. |
file_format_version |
String | - | Lance file format version: LEGACY or STABLE. |
batch_size |
Integer | 512 |
Number of rows per batch during writing. |
use_queued_write_buffer |
Boolean | false |
Use pipelined write buffer for improved throughput. |
queue_depth |
Integer | 8 |
Queue depth for pipelined writes (only used when use_queued_write_buffer=true). |
use_large_var_types |
Boolean | false |
Use 64-bit offset vectors for all string/binary columns to avoid 2GB batch limit. See Large Var Types. |
blob_pack_file_size_threshold |
Long | 1073741824 (1 GiB) |
Maximum size in bytes for blob v2 pack (.blob) sidecar files. When a pack file reaches this size, a new one is started. |
Example: Controlling File Size¶
Example: Using Stable Storage Format¶
Example: Using Pipelined Writes¶
Large Var Types¶
By default, Arrow uses 32-bit offset vectors (VarCharVector / VarBinaryVector) for string and binary columns, which limits the total data buffer to 2GB per batch. When writing rows with very large values (e.g., documents, images, serialized objects), a single batch can exceed this limit and fail with OversizedAllocationException.
Setting use_large_var_types to true switches all string and binary columns to 64-bit offset vectors (LargeVarCharVector / LargeVarBinaryVector), removing the 2GB-per-batch ceiling. This applies to all string and binary columns in the schema, including those nested inside structs, arrays, and maps.
Note
This differs from the per-column arrow.large_var_char table property, which is set at table creation time and applies only to specific columns. The use_large_var_types write option applies to all string/binary columns for a single write operation.