Skip to content

Data Types

Lance uses Apache Arrow as its in-memory data format. This guide covers the supported data types with a focus on array types, which are essential for vector embeddings and machine learning applications.

Arrow Type System

Lance supports the full Apache Arrow type system. When writing data through Python (PyArrow) or Rust (arrow-rs), the Arrow types are automatically mapped to Lance's internal representation.

Primitive Types

Arrow Type Description Example Use Case
Boolean True/false values Flags, filters
Int8, Int16, Int32, Int64 Signed integers IDs, counts
UInt8, UInt16, UInt32, UInt64 Unsigned integers IDs, indices
Float16, Float32, Float64 Floating point numbers Measurements, scores
Decimal128, Decimal256 Fixed-precision decimals Financial data
Date32, Date64 Date values Birth dates, event dates
Time32, Time64 Time values Time of day
Timestamp Date and time with timezone Event timestamps
Duration Time duration Elapsed time

String and Binary Types

Arrow Type Description Example Use Case
Utf8 Variable-length UTF-8 string Text, names
LargeUtf8 Large UTF-8 string (64-bit offsets) Large documents
Binary Variable-length binary data Raw bytes
LargeBinary Large binary data (64-bit offsets) Large blobs
FixedSizeBinary(n) Fixed-length binary data UUIDs, hashes

Blob Type for Large Binary Objects

Lance provides a specialized Blob type for efficiently storing and retrieving very large binary objects such as videos, images, audio files, or other multimedia content. Unlike regular binary columns, blobs are stored out-of-line and support lazy loading, which means you can read portions of the data without loading everything into memory.

To create a blob column, add the lance-encoding:blob metadata to a LargeBinary field:

import pyarrow as pa
import lance

# Define schema with a blob column for videos
schema = pa.schema([
    pa.field("id", pa.int64()),
    pa.field("filename", pa.utf8()),
    pa.field("video", pa.large_binary(), metadata={"lance-encoding:blob": "true"}),
])

# Read video file
with open("sample_video.mp4", "rb") as f:
    video_data = f.read()

# Create and write dataset
table = pa.table({
    "id": [1],
    "filename": ["sample_video.mp4"],
    "video": [video_data],
}, schema=schema)

ds = lance.write_dataset(table, "./videos.lance", schema=schema)

To read blob data, use take_blobs() which returns file-like objects for lazy reading:

# Retrieve blob as a file-like object (lazy loading)
blobs = ds.take_blobs("video", ids=[0])

# Use with libraries that accept file-like objects
import av  # pip install av
with av.open(blobs[0]) as container:
    for frame in container.decode(video=0):
        # Process video frames without loading entire video into memory
        pass

For more details, see the Blob API Guide.

Array Types for Vector Embeddings

Lance provides excellent support for array types, which are critical for storing vector embeddings in AI/ML applications.

FixedSizeList - The Preferred Type for Vector Embeddings

FixedSizeList is the recommended type for storing fixed-dimensional vector embeddings. Each vector has the same number of dimensions, making it highly efficient for storage and computation.

import lance
import pyarrow as pa
import numpy as np

# Create a schema with a vector embedding column
# This defines a 128-dimensional float32 vector
schema = pa.schema([
    pa.field("id", pa.int64()),
    pa.field("text", pa.utf8()),
    pa.field("vector", pa.list_(pa.float32(), 128)),  # FixedSizeList of 128 floats
])

# Create sample data with embeddings
num_rows = 1000
vectors = np.random.rand(num_rows, 128).astype(np.float32)

table = pa.Table.from_pydict({
    "id": list(range(num_rows)),
    "text": [f"document_{i}" for i in range(num_rows)],
    "vector": [v.tolist() for v in vectors],
}, schema=schema)

# Write to Lance format
ds = lance.write_dataset(table, "./embeddings.lance")
print(f"Created dataset with {ds.count_rows()} rows")
use arrow_array::{
    ArrayRef, FixedSizeListArray, Float32Array, Int64Array, RecordBatch, StringArray,
};
use arrow_schema::{DataType, Field, Schema};
use lance::dataset::WriteParams;
use lance::Dataset;
use std::sync::Arc;

#[tokio::main]
async fn main() -> lance::Result<()> {
    // Define schema with a 128-dimensional vector column
    let schema = Arc::new(Schema::new(vec![
        Field::new("id", DataType::Int64, false),
        Field::new("text", DataType::Utf8, false),
        Field::new(
            "vector",
            DataType::FixedSizeList(
                Arc::new(Field::new("item", DataType::Float32, true)),
                128,
            ),
            false,
        ),
    ]));

    // Create sample data
    let ids = Int64Array::from(vec![0, 1, 2]);
    let texts = StringArray::from(vec!["doc_0", "doc_1", "doc_2"]);

    // Create vector embeddings (128-dimensional)
    let values: Vec<f32> = (0..384).map(|i| i as f32 / 100.0).collect();
    let values_array = Float32Array::from(values);
    let vectors = FixedSizeListArray::try_new_from_values(values_array, 128)?;

    let batch = RecordBatch::try_new(
        schema.clone(),
        vec![
            Arc::new(ids) as ArrayRef,
            Arc::new(texts) as ArrayRef,
            Arc::new(vectors) as ArrayRef,
        ],
    )?;

    // Write to Lance
    let dataset = Dataset::write(
        vec![batch].into_iter().map(Ok),
        "embeddings.lance",
        WriteParams::default(),
    )
    .await?;

    println!("Created dataset with {} rows", dataset.count_rows().await?);
    Ok(())
}

Vector Search with Embeddings

Once you have vector embeddings stored in Lance, you can perform efficient vector similarity search:

import lance
import numpy as np

# Open the dataset
ds = lance.dataset("./embeddings.lance")

# Create a query vector (same dimension as stored vectors)
query_vector = np.random.rand(128).astype(np.float32).tolist()

# Perform vector search - find 10 nearest neighbors
results = ds.to_table(
    nearest={
        "column": "vector",
        "q": query_vector,
        "k": 10,
    }
)
print(results.to_pandas())

For production workloads with large datasets, create a vector index for much faster search:

# Create an IVF-PQ index for fast approximate nearest neighbor search
ds.create_index(
    "vector",
    index_type="IVF_PQ",
    num_partitions=256,  # Number of IVF partitions
    num_sub_vectors=16,  # Number of PQ sub-vectors
)

# Search with the index (automatically used)
results = ds.to_table(
    nearest={
        "column": "vector",
        "q": query_vector,
        "k": 10,
        "nprobes": 20,  # Number of partitions to search
    }
)

List and LargeList - Variable-Length Arrays

For variable-length arrays where each row may have a different number of elements, use List or LargeList:

import lance
import pyarrow as pa

# Schema with variable-length arrays
schema = pa.schema([
    pa.field("id", pa.int64()),
    pa.field("tags", pa.list_(pa.utf8())),      # Variable number of string tags
    pa.field("scores", pa.list_(pa.float32())), # Variable number of float scores
])

table = pa.Table.from_pydict({
    "id": [1, 2, 3],
    "tags": [["python", "ml"], ["rust"], ["data", "analytics", "ai"]],
    "scores": [[0.9, 0.8], [0.95], [0.7, 0.85, 0.9]],
}, schema=schema)

ds = lance.write_dataset(table, "./variable_arrays.lance")

Nested and Complex Types

Struct Types

Store structured data with multiple named fields:

import lance
import pyarrow as pa

# Schema with nested struct
schema = pa.schema([
    pa.field("id", pa.int64()),
    pa.field("metadata", pa.struct([
        pa.field("source", pa.utf8()),
        pa.field("timestamp", pa.timestamp("us")),
        pa.field("embedding_model", pa.utf8()),
    ])),
    pa.field("vector", pa.list_(pa.float32(), 384)),  # 384-dim embedding
])

table = pa.Table.from_pydict({
    "id": [1, 2],
    "metadata": [
        {"source": "web", "timestamp": "2024-01-15T10:30:00", "embedding_model": "text-embedding-3-small"},
        {"source": "api", "timestamp": "2024-01-15T11:45:00", "embedding_model": "text-embedding-3-small"},
    ],
    "vector": [
        [0.1] * 384,
        [0.2] * 384,
    ],
}, schema=schema)

ds = lance.write_dataset(table, "./with_metadata.lance")

Map Types

Store key-value pairs with dynamic keys:

import lance
import pyarrow as pa

schema = pa.schema([
    pa.field("id", pa.int64()),
    pa.field("attributes", pa.map_(pa.utf8(), pa.utf8())),
])

table = pa.Table.from_pydict({
    "id": [1, 2],
    "attributes": [
        [("color", "red"), ("size", "large")],
        [("color", "blue"), ("material", "cotton")],
    ],
}, schema=schema)

ds = lance.write_dataset(table, "./with_maps.lance")

Data Type Mapping for Integrations

When integrating Lance with other systems (like Apache Flink, Spark, or Presto), the following type mappings apply:

External Type Lance/Arrow Type Notes
BOOLEAN Boolean
TINYINT Int8
SMALLINT Int16
INT / INTEGER Int32
BIGINT Int64
FLOAT Float32
DOUBLE Float64
DECIMAL(p,s) Decimal128(p,s)
STRING / VARCHAR Utf8
CHAR(n) Utf8 Fixed-width in source system; stored as variable-length Utf8
DATE Date32
TIME Time64 Microsecond precision
TIMESTAMP Timestamp
TIMESTAMP WITH LOCAL TIMEZONE Timestamp With timezone info
BINARY / VARBINARY Binary
BYTES Binary
BLOB LargeBinary with lance-encoding:blob Large binary objects with lazy loading
ARRAY<T> List(T) Variable-length array
ARRAY<T>(n) FixedSizeList(T, n) Fixed-length array (vectors)
ROW / STRUCT Struct Nested structure
MAP<K,V> Map(K, V) Key-value pairs

Vector Embeddings in Integrations

For vector embedding columns, use ARRAY<FLOAT>(n) or ARRAY<DOUBLE>(n) where n is the embedding dimension:

-- Example: Creating a table with vector embeddings in SQL-compatible systems
CREATE TABLE embeddings (
    id BIGINT,
    text STRING,
    vector ARRAY<FLOAT>(384)  -- 384-dimensional vector
);

This maps to Lance's FixedSizeList(Float32, 384) type, which is optimized for:

  • Efficient columnar storage
  • SIMD-accelerated distance computations
  • Vector index creation and search

Best Practices for Vector Data

  1. Use FixedSizeList for embeddings: Always use FixedSizeList (not variable-length List) for vector embeddings to enable efficient storage and indexing.

  2. Choose appropriate precision:

  3. Float32 is the standard choice, balancing precision and storage
  4. Float16 or BFloat16 can reduce storage by 50% with minimal accuracy loss
  5. Int8 for quantized embeddings

  6. Align dimensions for SIMD: Vector dimensions divisible by 8 enable optimal SIMD acceleration. Common dimensions: 128, 256, 384, 512, 768, 1024, 1536.

  7. Create indexes for large datasets: For datasets with more than ~10,000 vectors, create an ANN index for fast search:

    # IVF_PQ is recommended for most use cases
    ds.create_index("vector", index_type="IVF_PQ", num_partitions=256, num_sub_vectors=16)
    
    # IVF_HNSW_SQ offers better recall at the cost of more memory
    ds.create_index("vector", index_type="IVF_HNSW_SQ", num_partitions=256)
    
  8. Store metadata alongside vectors: Lance efficiently handles mixed workloads with both vector and scalar data:

    # Combine vector search with metadata filtering
    results = ds.to_table(
        filter="category = 'electronics'",
        nearest={"column": "vector", "q": query, "k": 10}
    )
    

See Also