Skip to content

INSERT INTO

Add data to existing Lance tables using SQL or DataFrames.

Basic Insert

INSERT INTO users VALUES
    (4, 'David', 'david@example.com', '2024-01-15 10:30:00'),
    (5, 'Eva', 'eva@example.com', '2024-01-15 11:45:00');
# Create new data
new_data = [
    (8, "Henry", "henry@example.com"),
    (9, "Ivy", "ivy@example.com")
]
new_df = spark.createDataFrame(new_data, ["id", "name", "email"])

# Append to existing table
new_df.writeTo("users").append()

# Alternative: use traditional write API with mode
new_df.write.mode("append").saveAsTable("users")
// Create new data
val newData = Seq(
    (8, "Henry", "henry@example.com"),
    (9, "Ivy", "ivy@example.com")
)
val newDF = newData.toDF("id", "name", "email")

// Append to existing table
newDF.writeTo("users").append()

// Alternative: use traditional write API with mode
newDF.write.mode("append").saveAsTable("users")
// Create new data
List<Row> newData = Arrays.asList(
    RowFactory.create(8L, "Henry", "henry@example.com"),
    RowFactory.create(9L, "Ivy", "ivy@example.com")
);
Dataset<Row> newDF = spark.createDataFrame(newData, schema);

// Append to existing table
newDF.writeTo("users").append();

// Alternative: use traditional write API with mode
newDF.write().mode("append").saveAsTable("users");

Insert with Column Specification

INSERT INTO users (id, name, email) VALUES
    (6, 'Frank', 'frank@example.com'),
    (7, 'Grace', 'grace@example.com');

Insert from SELECT

INSERT INTO users
SELECT user_id as id, username as name, email_address as email, signup_date as created_at
FROM staging.user_signups
WHERE signup_date >= '2024-01-01';

Insert with Complex Data Types

INSERT INTO events VALUES (
    1001,
    123,
    'page_view',
    array('web', 'desktop'),
    struct('web_app', 1, '2024-01-15 12:00:00'),
    '2024-01-15 12:00:00'
);

Writing Vector Data

If you created a table with the arrow.fixed-size-list.size property (see CREATE TABLE), subsequent writes will automatically use FixedSizeList. No additional configuration is needed:

-- Insert vector data (example with small vectors for clarity)
INSERT INTO embeddings_table VALUES
    (10, 'new text', array(0.1, 0.2, 0.3, ...)); -- 128 float values
# Table was created with: 'embeddings.arrow.fixed-size-list.size' = '128'
# Subsequent writes automatically use FixedSizeList encoding

# Plain schema WITHOUT metadata - it just works!
data = [(i, [float(j) for j in range(128)]) for i in range(10, 20)]
df = spark.createDataFrame(data, ["id", "embeddings"])

df.writeTo("embeddings_table").append()
// Table was created with: 'embeddings.arrow.fixed-size-list.size' = '128'
// Subsequent writes automatically use FixedSizeList encoding

val data = (10 until 20).map { i =>
  (i, Array.fill(128)(Random.nextFloat()))
}
val df = data.toDF("id", "embeddings")

df.writeTo("embeddings_table").append()
// Table was created with: 'embeddings.arrow.fixed-size-list.size' = '128'
// Subsequent writes automatically use FixedSizeList encoding

List<Row> rows = new ArrayList<>();
for (int i = 10; i < 20; i++) {
    float[] vector = new float[128];
    for (int j = 0; j < 128; j++) {
        vector[j] = random.nextFloat();
    }
    rows.add(RowFactory.create(i, vector));
}
Dataset<Row> df = spark.createDataFrame(rows, schema);

df.writeTo("embeddings_table").append();

Writing Blob Data

If you created a table with the lance.encoding property (see CREATE TABLE), subsequent writes will automatically use blob encoding. No additional configuration is needed:

-- Insert blob data (example with binary literals)
INSERT INTO documents VALUES
    (3, 'Document 3', X'48656C6C6F576F726C64');
# Table was created with: 'content.lance.encoding' = 'blob'
# Subsequent writes automatically use blob encoding

# Plain schema WITHOUT metadata - it just works!
data = [(i, f"Document {i}", bytearray(b"Large content..." * 10000)) for i in range(10, 20)]
df = spark.createDataFrame(data, ["id", "title", "content"])

df.writeTo("documents").append()
// Table was created with: 'content.lance.encoding' = 'blob'
// Subsequent writes automatically use blob encoding

val data = (10 until 20).map { i =>
  (i, s"Document $i", Array.fill[Byte](100000)(0x42))
}
val df = data.toDF("id", "title", "content")

df.writeTo("documents").append()
// Table was created with: 'content.lance.encoding' = 'blob'
// Subsequent writes automatically use blob encoding

List<Row> data = new ArrayList<>();
for (int i = 10; i < 20; i++) {
    byte[] content = new byte[100000];
    Arrays.fill(content, (byte) 0x42);
    data.add(RowFactory.create(i, "Document " + i, content));
}
Dataset<Row> df = spark.createDataFrame(data, schema);

df.writeTo("documents").append();

Copying blob v2 columns

Blob v2 columns read as descriptor structs. Copy them with a direct select:

INSERT INTO documents_copy SELECT id, content FROM documents;

Only direct column references (or simple aliases) from a single Lance source table are copied this way; ORDER BY and LIMIT on the SELECT are supported. writeTo().append() and .overwrite() work the same way. Transformed values (e.g. content.size), DISTINCT, GROUP BY, UNION, ordering by the blob column itself, and joins of multiple blob sources keep their descriptor semantics and cannot be inserted into a blob column. See CREATE TABLE for creating a new table from a blob v2 query.

On Spark 3.5+, MERGE INTO and UPDATE deep-copy blob v2 columns the same way: a blob assigned from a source column (SET t.content = s.content) is copied from the source table, and blobs the command does not touch are carried forward from the target. Additional limitations apply:

  • Assigning a binary literal to a blob v2 column through UPDATE or MERGE is rejected at analysis; use INSERT for direct binary writes.
  • Dynamic partition overwrite is not supported.
  • On Spark 3.4, MERGE and UPDATE are not supported for blob v2 tables.

Writing Large String Data

If you created a table with the arrow.large_var_char property (see CREATE TABLE), subsequent writes will automatically use LargeVarCharVector. No additional configuration is needed:

-- Insert large string data
INSERT INTO articles VALUES
    (3, 'Article 3', 'Very long article content...');
# Table was created with: 'content.arrow.large_var_char' = 'true'
# Subsequent writes automatically use large string encoding

data = [(i, f"Article {i}", "Long content..." * 100000) for i in range(10, 20)]
df = spark.createDataFrame(data, ["id", "title", "content"])

df.writeTo("articles").append()
// Table was created with: 'content.arrow.large_var_char' = 'true'
// Subsequent writes automatically use large string encoding

val data = (10 until 20).map { i =>
  (i, s"Article $i", "Long content..." * 100000)
}
val df = data.toDF("id", "title", "content")

df.writeTo("articles").append()
// Table was created with: 'content.arrow.large_var_char' = 'true'
// Subsequent writes automatically use large string encoding

String longContent = String.join("", Collections.nCopies(100000, "Long content..."));
List<Row> data = new ArrayList<>();
for (int i = 10; i < 20; i++) {
    data.add(RowFactory.create(i, "Article " + i, longContent));
}
Dataset<Row> df = spark.createDataFrame(data, schema);

df.writeTo("articles").append();

Write Options

These options control how data is written to Lance datasets. They can be set using the .option() method when writing data.

Option Type Default Description
write_mode String append Write mode: append to add to existing data, overwrite to replace existing data.
max_row_per_file Integer - Maximum number of rows per Lance file.
max_rows_per_group Integer - Maximum number of rows per row group within a file.
max_bytes_per_file Long - Maximum size in bytes per Lance file.
file_format_version String - Lance file format version: LEGACY or STABLE.
batch_size Integer 512 Number of rows per batch during writing.
use_queued_write_buffer Boolean false Use pipelined write buffer for improved throughput.
queue_depth Integer 8 Queue depth for pipelined writes (only used when use_queued_write_buffer=true).
use_large_var_types Boolean false Use 64-bit offset vectors for all string/binary columns to avoid 2GB batch limit. See Large Var Types.
blob_pack_file_size_threshold Long 1073741824 (1 GiB) Maximum size in bytes for blob v2 pack (.blob) sidecar files. When a pack file reaches this size, a new one is started.

Example: Controlling File Size

df.write \
    .format("lance") \
    .option("max_row_per_file", "100000") \
    .option("max_rows_per_group", "10000") \
    .save("/path/to/output.lance")
df.write
    .format("lance")
    .option("max_row_per_file", "100000")
    .option("max_rows_per_group", "10000")
    .save("/path/to/output.lance")
df.write()
    .format("lance")
    .option("max_row_per_file", "100000")
    .option("max_rows_per_group", "10000")
    .save("/path/to/output.lance");

Example: Using Stable Storage Format

df.write \
    .format("lance") \
    .option("file_format_version", "STABLE") \
    .save("/path/to/output.lance")
df.write
    .format("lance")
    .option("file_format_version", "STABLE")
    .save("/path/to/output.lance")

Example: Using Pipelined Writes

# Enable pipelined writes for improved throughput
df.write \
    .format("lance") \
    .option("use_queued_write_buffer", "true") \
    .option("queue_depth", "4") \
    .save("/path/to/output.lance")
// Enable pipelined writes for improved throughput
df.write
    .format("lance")
    .option("use_queued_write_buffer", "true")
    .option("queue_depth", "4")
    .save("/path/to/output.lance")

Large Var Types

By default, Arrow uses 32-bit offset vectors (VarCharVector / VarBinaryVector) for string and binary columns, which limits the total data buffer to 2GB per batch. When writing rows with very large values (e.g., documents, images, serialized objects), a single batch can exceed this limit and fail with OversizedAllocationException.

Setting use_large_var_types to true switches all string and binary columns to 64-bit offset vectors (LargeVarCharVector / LargeVarBinaryVector), removing the 2GB-per-batch ceiling. This applies to all string and binary columns in the schema, including those nested inside structs, arrays, and maps.

Note

This differs from the per-column arrow.large_var_char table property, which is set at table creation time and applies only to specific columns. The use_large_var_types write option applies to all string/binary columns for a single write operation.

df.write \
    .format("lance") \
    .option("use_large_var_types", "true") \
    .save("/path/to/output.lance")
df.write
    .format("lance")
    .option("use_large_var_types", "true")
    .save("/path/to/output.lance")