Data Types¶
Lance uses Apache Arrow as its in-memory data format. This guide covers the supported data types with a focus on array types, which are essential for vector embeddings and machine learning applications.
Arrow Type System¶
Lance supports the full Apache Arrow type system. When writing data through Python (PyArrow) or Rust (arrow-rs), the Arrow types are automatically mapped to Lance's internal representation.
Primitive Types¶
| Arrow Type | Description | Example Use Case |
|---|---|---|
Boolean |
True/false values | Flags, filters |
Int8, Int16, Int32, Int64 |
Signed integers | IDs, counts |
UInt8, UInt16, UInt32, UInt64 |
Unsigned integers | IDs, indices |
Float16, Float32, Float64 |
Floating point numbers | Measurements, scores |
Decimal128, Decimal256 |
Fixed-precision decimals | Financial data |
Date32, Date64 |
Date values | Birth dates, event dates |
Time32, Time64 |
Time values | Time of day |
Timestamp |
Date and time with timezone | Event timestamps |
Duration |
Time duration | Elapsed time |
String and Binary Types¶
| Arrow Type | Description | Example Use Case |
|---|---|---|
Utf8 |
Variable-length UTF-8 string | Text, names |
LargeUtf8 |
Large UTF-8 string (64-bit offsets) | Large documents |
Binary |
Variable-length binary data | Raw bytes |
LargeBinary |
Large binary data (64-bit offsets) | Large blobs |
FixedSizeBinary(n) |
Fixed-length binary data | UUIDs, hashes |
Blob Type for Large Binary Objects¶
Lance provides a specialized Blob type for efficiently storing and retrieving very large binary objects such as videos, images, audio files, or other multimedia content. Unlike regular binary columns, blobs are stored out-of-line and support lazy loading, which means you can read portions of the data without loading everything into memory.
To create a blob column, add the lance-encoding:blob metadata to a LargeBinary field:
import pyarrow as pa
import lance
# Define schema with a blob column for videos
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("filename", pa.utf8()),
pa.field("video", pa.large_binary(), metadata={"lance-encoding:blob": "true"}),
])
# Read video file
with open("sample_video.mp4", "rb") as f:
video_data = f.read()
# Create and write dataset
table = pa.table({
"id": [1],
"filename": ["sample_video.mp4"],
"video": [video_data],
}, schema=schema)
ds = lance.write_dataset(table, "./videos.lance", schema=schema)
To read blob data, use take_blobs() which returns file-like objects for lazy reading:
# Retrieve blob as a file-like object (lazy loading)
blobs = ds.take_blobs("video", ids=[0])
# Use with libraries that accept file-like objects
import av # pip install av
with av.open(blobs[0]) as container:
for frame in container.decode(video=0):
# Process video frames without loading entire video into memory
pass
For more details, see the Blob API Guide.
Array Types for Vector Embeddings¶
Lance provides excellent support for array types, which are critical for storing vector embeddings in AI/ML applications.
FixedSizeList - The Preferred Type for Vector Embeddings¶
FixedSizeList is the recommended type for storing fixed-dimensional vector embeddings. Each vector has the same number of dimensions, making it highly efficient for storage and computation.
import lance
import pyarrow as pa
import numpy as np
# Create a schema with a vector embedding column
# This defines a 128-dimensional float32 vector
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("text", pa.utf8()),
pa.field("vector", pa.list_(pa.float32(), 128)), # FixedSizeList of 128 floats
])
# Create sample data with embeddings
num_rows = 1000
vectors = np.random.rand(num_rows, 128).astype(np.float32)
table = pa.Table.from_pydict({
"id": list(range(num_rows)),
"text": [f"document_{i}" for i in range(num_rows)],
"vector": [v.tolist() for v in vectors],
}, schema=schema)
# Write to Lance format
ds = lance.write_dataset(table, "./embeddings.lance")
print(f"Created dataset with {ds.count_rows()} rows")
use arrow_array::{
ArrayRef, FixedSizeListArray, Float32Array, Int64Array, RecordBatch, StringArray,
};
use arrow_schema::{DataType, Field, Schema};
use lance::dataset::WriteParams;
use lance::Dataset;
use std::sync::Arc;
#[tokio::main]
async fn main() -> lance::Result<()> {
// Define schema with a 128-dimensional vector column
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int64, false),
Field::new("text", DataType::Utf8, false),
Field::new(
"vector",
DataType::FixedSizeList(
Arc::new(Field::new("item", DataType::Float32, true)),
128,
),
false,
),
]));
// Create sample data
let ids = Int64Array::from(vec![0, 1, 2]);
let texts = StringArray::from(vec!["doc_0", "doc_1", "doc_2"]);
// Create vector embeddings (128-dimensional)
let values: Vec<f32> = (0..384).map(|i| i as f32 / 100.0).collect();
let values_array = Float32Array::from(values);
let vectors = FixedSizeListArray::try_new_from_values(values_array, 128)?;
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(ids) as ArrayRef,
Arc::new(texts) as ArrayRef,
Arc::new(vectors) as ArrayRef,
],
)?;
// Write to Lance
let dataset = Dataset::write(
vec![batch].into_iter().map(Ok),
"embeddings.lance",
WriteParams::default(),
)
.await?;
println!("Created dataset with {} rows", dataset.count_rows().await?);
Ok(())
}
Vector Search with Embeddings¶
Once you have vector embeddings stored in Lance, you can perform efficient vector similarity search:
import lance
import numpy as np
# Open the dataset
ds = lance.dataset("./embeddings.lance")
# Create a query vector (same dimension as stored vectors)
query_vector = np.random.rand(128).astype(np.float32).tolist()
# Perform vector search - find 10 nearest neighbors
results = ds.to_table(
nearest={
"column": "vector",
"q": query_vector,
"k": 10,
}
)
print(results.to_pandas())
For production workloads with large datasets, create a vector index for much faster search:
# Create an IVF-PQ index for fast approximate nearest neighbor search
ds.create_index(
"vector",
index_type="IVF_PQ",
num_partitions=256, # Number of IVF partitions
num_sub_vectors=16, # Number of PQ sub-vectors
)
# Search with the index (automatically used)
results = ds.to_table(
nearest={
"column": "vector",
"q": query_vector,
"k": 10,
"nprobes": 20, # Number of partitions to search
}
)
List and LargeList - Variable-Length Arrays¶
For variable-length arrays where each row may have a different number of elements, use List or LargeList:
import lance
import pyarrow as pa
# Schema with variable-length arrays
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("tags", pa.list_(pa.utf8())), # Variable number of string tags
pa.field("scores", pa.list_(pa.float32())), # Variable number of float scores
])
table = pa.Table.from_pydict({
"id": [1, 2, 3],
"tags": [["python", "ml"], ["rust"], ["data", "analytics", "ai"]],
"scores": [[0.9, 0.8], [0.95], [0.7, 0.85, 0.9]],
}, schema=schema)
ds = lance.write_dataset(table, "./variable_arrays.lance")
Nested and Complex Types¶
Struct Types¶
Store structured data with multiple named fields:
import lance
import pyarrow as pa
# Schema with nested struct
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("metadata", pa.struct([
pa.field("source", pa.utf8()),
pa.field("timestamp", pa.timestamp("us")),
pa.field("embedding_model", pa.utf8()),
])),
pa.field("vector", pa.list_(pa.float32(), 384)), # 384-dim embedding
])
table = pa.Table.from_pydict({
"id": [1, 2],
"metadata": [
{"source": "web", "timestamp": "2024-01-15T10:30:00", "embedding_model": "text-embedding-3-small"},
{"source": "api", "timestamp": "2024-01-15T11:45:00", "embedding_model": "text-embedding-3-small"},
],
"vector": [
[0.1] * 384,
[0.2] * 384,
],
}, schema=schema)
ds = lance.write_dataset(table, "./with_metadata.lance")
Map Types¶
Store key-value pairs with dynamic keys:
import lance
import pyarrow as pa
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("attributes", pa.map_(pa.utf8(), pa.utf8())),
])
table = pa.Table.from_pydict({
"id": [1, 2],
"attributes": [
[("color", "red"), ("size", "large")],
[("color", "blue"), ("material", "cotton")],
],
}, schema=schema)
ds = lance.write_dataset(table, "./with_maps.lance")
Data Type Mapping for Integrations¶
When integrating Lance with other systems (like Apache Flink, Spark, or Presto), the following type mappings apply:
| External Type | Lance/Arrow Type | Notes |
|---|---|---|
BOOLEAN |
Boolean |
|
TINYINT |
Int8 |
|
SMALLINT |
Int16 |
|
INT / INTEGER |
Int32 |
|
BIGINT |
Int64 |
|
FLOAT |
Float32 |
|
DOUBLE |
Float64 |
|
DECIMAL(p,s) |
Decimal128(p,s) |
|
STRING / VARCHAR |
Utf8 |
|
CHAR(n) |
Utf8 |
Fixed-width in source system; stored as variable-length Utf8 |
DATE |
Date32 |
|
TIME |
Time64 |
Microsecond precision |
TIMESTAMP |
Timestamp |
|
TIMESTAMP WITH LOCAL TIMEZONE |
Timestamp |
With timezone info |
BINARY / VARBINARY |
Binary |
|
BYTES |
Binary |
|
BLOB |
LargeBinary with lance-encoding:blob |
Large binary objects with lazy loading |
ARRAY<T> |
List(T) |
Variable-length array |
ARRAY<T>(n) |
FixedSizeList(T, n) |
Fixed-length array (vectors) |
ROW / STRUCT |
Struct |
Nested structure |
MAP<K,V> |
Map(K, V) |
Key-value pairs |
Vector Embeddings in Integrations¶
For vector embedding columns, use ARRAY<FLOAT>(n) or ARRAY<DOUBLE>(n) where n is the embedding dimension:
-- Example: Creating a table with vector embeddings in SQL-compatible systems
CREATE TABLE embeddings (
id BIGINT,
text STRING,
vector ARRAY<FLOAT>(384) -- 384-dimensional vector
);
This maps to Lance's FixedSizeList(Float32, 384) type, which is optimized for:
- Efficient columnar storage
- SIMD-accelerated distance computations
- Vector index creation and search
Best Practices for Vector Data¶
-
Use FixedSizeList for embeddings: Always use
FixedSizeList(not variable-lengthList) for vector embeddings to enable efficient storage and indexing. -
Choose appropriate precision:
Float32is the standard choice, balancing precision and storageFloat16orBFloat16can reduce storage by 50% with minimal accuracy loss-
Int8for quantized embeddings -
Align dimensions for SIMD: Vector dimensions divisible by 8 enable optimal SIMD acceleration. Common dimensions: 128, 256, 384, 512, 768, 1024, 1536.
-
Create indexes for large datasets: For datasets with more than ~10,000 vectors, create an ANN index for fast search:
-
Store metadata alongside vectors: Lance efficiently handles mixed workloads with both vector and scalar data:
See Also¶
- Vector Search Tutorial - Complete guide to vector search with Lance
- Blob API Guide - Storing and retrieving large binary objects (videos, images)
- Extension Arrays - Special array types for ML (BFloat16, images)
- Performance Guide - Optimization tips for large-scale deployments