INSERT INTO¶
Add data to existing Lance tables using SQL or DataFrames.
Basic Insert¶
# Create new data
new_data = [
(8, "Henry", "henry@example.com"),
(9, "Ivy", "ivy@example.com")
]
new_df = spark.createDataFrame(new_data, ["id", "name", "email"])
# Append to existing table
new_df.writeTo("users").append()
# Alternative: use traditional write API with mode
new_df.write.mode("append").saveAsTable("users")
// Create new data
val newData = Seq(
(8, "Henry", "henry@example.com"),
(9, "Ivy", "ivy@example.com")
)
val newDF = newData.toDF("id", "name", "email")
// Append to existing table
newDF.writeTo("users").append()
// Alternative: use traditional write API with mode
newDF.write.mode("append").saveAsTable("users")
// Create new data
List<Row> newData = Arrays.asList(
RowFactory.create(8L, "Henry", "henry@example.com"),
RowFactory.create(9L, "Ivy", "ivy@example.com")
);
Dataset<Row> newDF = spark.createDataFrame(newData, schema);
// Append to existing table
newDF.writeTo("users").append();
// Alternative: use traditional write API with mode
newDF.write().mode("append").saveAsTable("users");
Insert with Column Specification¶
Insert from SELECT¶
Insert with Complex Data Types¶
Writing Vector Data¶
If you created a table with the arrow.fixed-size-list.size property (see CREATE TABLE),
subsequent writes will automatically use FixedSizeList. No additional configuration is needed:
# Table was created with: 'embeddings.arrow.fixed-size-list.size' = '128'
# Subsequent writes automatically use FixedSizeList encoding
# Plain schema WITHOUT metadata - it just works!
data = [(i, [float(j) for j in range(128)]) for i in range(10, 20)]
df = spark.createDataFrame(data, ["id", "embeddings"])
df.writeTo("embeddings_table").append()
// Table was created with: 'embeddings.arrow.fixed-size-list.size' = '128'
// Subsequent writes automatically use FixedSizeList encoding
List<Row> rows = new ArrayList<>();
for (int i = 10; i < 20; i++) {
float[] vector = new float[128];
for (int j = 0; j < 128; j++) {
vector[j] = random.nextFloat();
}
rows.add(RowFactory.create(i, vector));
}
Dataset<Row> df = spark.createDataFrame(rows, schema);
df.writeTo("embeddings_table").append();
Writing Blob Data¶
If you created a table with the lance.encoding property (see CREATE TABLE),
subsequent writes will automatically use blob encoding. No additional configuration is needed:
# Table was created with: 'content.lance.encoding' = 'blob'
# Subsequent writes automatically use blob encoding
# Plain schema WITHOUT metadata - it just works!
data = [(i, f"Document {i}", bytearray(b"Large content..." * 10000)) for i in range(10, 20)]
df = spark.createDataFrame(data, ["id", "title", "content"])
df.writeTo("documents").append()
// Table was created with: 'content.lance.encoding' = 'blob'
// Subsequent writes automatically use blob encoding
List<Row> data = new ArrayList<>();
for (int i = 10; i < 20; i++) {
byte[] content = new byte[100000];
Arrays.fill(content, (byte) 0x42);
data.add(RowFactory.create(i, "Document " + i, content));
}
Dataset<Row> df = spark.createDataFrame(data, schema);
df.writeTo("documents").append();
Writing Large String Data¶
If you created a table with the arrow.large_var_char property (see CREATE TABLE),
subsequent writes will automatically use LargeVarCharVector. No additional configuration is needed:
// Table was created with: 'content.arrow.large_var_char' = 'true'
// Subsequent writes automatically use large string encoding
String longContent = String.join("", Collections.nCopies(100000, "Long content..."));
List<Row> data = new ArrayList<>();
for (int i = 10; i < 20; i++) {
data.add(RowFactory.create(i, "Article " + i, longContent));
}
Dataset<Row> df = spark.createDataFrame(data, schema);
df.writeTo("articles").append();
Write Options¶
These options control how data is written to Lance datasets. They can be set using the .option() method when writing data.
| Option | Type | Default | Description |
|---|---|---|---|
write_mode |
String | append |
Write mode: append to add to existing data, overwrite to replace existing data. |
max_row_per_file |
Integer | - | Maximum number of rows per Lance file. |
max_rows_per_group |
Integer | - | Maximum number of rows per row group within a file. |
max_bytes_per_file |
Long | - | Maximum size in bytes per Lance file. |
file_format_version |
String | - | Lance file format version: LEGACY or STABLE. |
batch_size |
Integer | 512 |
Number of rows per batch during writing. |
use_queued_write_buffer |
Boolean | false |
Use pipelined write buffer for improved throughput. |
queue_depth |
Integer | 8 |
Queue depth for pipelined writes (only used when use_queued_write_buffer=true). |
use_large_var_types |
Boolean | false |
Use 64-bit offset vectors for all string/binary columns to avoid 2GB batch limit. See Large Var Types. |
blob_pack_file_size_threshold |
Long | 1073741824 (1 GiB) |
Maximum size in bytes for blob v2 pack (.blob) sidecar files. When a pack file reaches this size, a new one is started. |
Example: Controlling File Size¶
Example: Using Stable Storage Format¶
Example: Using Pipelined Writes¶
Large Var Types¶
By default, Arrow uses 32-bit offset vectors (VarCharVector / VarBinaryVector) for string and binary columns, which limits the total data buffer to 2GB per batch. When writing rows with very large values (e.g., documents, images, serialized objects), a single batch can exceed this limit and fail with OversizedAllocationException.
Setting use_large_var_types to true switches all string and binary columns to 64-bit offset vectors (LargeVarCharVector / LargeVarBinaryVector), removing the 2GB-per-batch ceiling. This applies to all string and binary columns in the schema, including those nested inside structs, arrays, and maps.
Note
This differs from the per-column arrow.large_var_char table property, which is set at table creation time and applies only to specific columns. The use_large_var_types write option applies to all string/binary columns for a single write operation.