Skip to content

Schema Format Specification

Overview

The schema describes the structure of a Lance table, including all fields, their data types, and metadata. Schemas use a logical type system where data types are represented as strings that map to Apache Arrow data types. Each field in the schema has a unique identifier (field ID) that enables robust schema evolution and version tracking.

Note

Logical types are currently being simplified through discussion #5864. Proposed changes include consolidating encoding-specific variants (e.g., large_string and string, large_binary and binary) into single logical types with runtime optimization. Additionally, #5817 proposes adding string_view and binary_view types. This document describes the current implementation.

Data Types

Lance supports a comprehensive set of data types that map to Apache Arrow types. Data types are represented as strings in the schema and can be grouped into several categories.

Primitive Types

Logical Type Arrow Type Description
null Null Null type (no values)
bool Boolean Boolean (true/false)
int8 Int8 Signed 8-bit integer
uint8 UInt8 Unsigned 8-bit integer
int16 Int16 Signed 16-bit integer
uint16 UInt16 Unsigned 16-bit integer
int32 Int32 Signed 32-bit integer
uint32 UInt32 Unsigned 32-bit integer
int64 Int64 Signed 64-bit integer
uint64 UInt64 Unsigned 64-bit integer

Floating Point Types

Logical Type Arrow Type Description
halffloat Float16 IEEE 754 half-precision floating point (16-bit)
float Float32 IEEE 754 single-precision floating point (32-bit)
double Float64 IEEE 754 double-precision floating point (64-bit)

String and Binary Types

Logical Type Arrow Type Description
string Utf8 Variable-length UTF-8 encoded string
binary Binary Variable-length binary data
large_string LargeUtf8 Variable-length UTF-8 string (supports large offsets)
large_binary LargeBinary Variable-length binary data (supports large offsets)

Decimal Types

Decimal types support arbitrary-precision numeric values. The format is: decimal:<bit_width>:<precision>:<scale>

Logical Type Arrow Type Precision Example
decimal:128:P:S Decimal128 Up to 38 digits decimal:128:10:2 (10 total digits, 2 after decimal)
decimal:256:P:S Decimal256 Up to 76 digits decimal:256:20:5
  • Precision (P): Total number of digits (1-38 for Decimal128, up to 76 for Decimal256)
  • Scale (S): Number of digits after the decimal point (0 ≤ S ≤ P)

Date and Time Types

Logical Type Arrow Type Description
date32:day Date32 Date (days since epoch)
date64:ms Date64 Date (milliseconds since epoch)
time32:s Time32 Time (seconds since midnight)
time32:ms Time32 Time (milliseconds since midnight)
time64:us Time64 Time (microseconds since midnight)
time64:ns Time64 Time (nanoseconds since midnight)
duration:s Duration Duration (seconds)
duration:ms Duration Duration (milliseconds)
duration:us Duration Duration (microseconds)
duration:ns Duration Duration (nanoseconds)

Timestamp Types

Timestamp types represent a point in time and may include timezone information. Format: timestamp:<unit>:<timezone>

  • Unit: s (seconds), ms (milliseconds), us (microseconds), ns (nanoseconds)
  • Timezone: IANA timezone string (e.g., UTC, America/New_York) or - for no timezone

Examples: - timestamp:us:UTC - Microsecond precision timestamp in UTC - timestamp:ms:America/New_York - Millisecond precision timestamp in America/New_York timezone - timestamp:ns:- - Nanosecond precision timestamp with no timezone

Complex Types

Struct Type

A struct is a container for named fields with heterogeneous types.

Logical Type Arrow Type Description
struct Struct Composite type containing multiple named fields

Struct fields are represented as child fields in the schema.

Example schema with a struct:

Field {
    name: "address"
    type: "struct"
    children: [
        Field { name: "street", type: "string" },
        Field { name: "city", type: "string" },
        Field { name: "zip", type: "int32" }
    ]
}

List Types

Lists represent variable-length arrays of a single type.

Logical Type Arrow Type Description
list List Variable-length list of values
list.struct List(Struct) Variable-length list of struct values
large_list LargeList Variable-length list (supports large offsets)
large_list.struct LargeList(Struct) Variable-length list of struct values (large offsets)

The element type is specified as a child field.

Fixed-Size List Types

Fixed-size lists have a predetermined size known at schema definition time. Format: fixed_size_list:<element_type>:<size>

Logical Type Description Example
fixed_size_list:float:128 Fixed-size list of 128 floats Vector embeddings (128-dimensional)
fixed_size_list:int32:10 Fixed-size list of 10 integers

Special extension types: - fixed_size_list:lance.bfloat16:256 - Fixed-size list of bfloat16 values

Fixed-Size Binary Type

Fixed-size binary data with a predetermined size in bytes. Format: fixed_size_binary:<size>

Logical Type Description Example
fixed_size_binary:16 Fixed-size binary of 16 bytes MD5 hash
fixed_size_binary:32 Fixed-size binary of 32 bytes SHA-256 hash

Dictionary Type

Dictionary-encoded data with separate keys and values. Format: dict:<value_type>:<key_type>:<ordered>

  • Value type: The type of dictionary values
  • Key type: The type used for dictionary indices (typically int8, int16, or int32)
  • Ordered: Boolean indicating if dictionary values are sorted (currently not fully supported)

Example: dict:string:int16:false - Dictionary-encoded strings with int16 keys

Map Type

Key-value pairs stored in a structured format.

Logical Type Arrow Type Description
map Map Key-value pairs (currently supports unordered keys only)

Maps have key and value types specified as child fields.

Extension Types

Lance supports custom extension types that provide semantic meaning on top of Arrow types.

Blob Type

Represents large binary data stored externally.

Logical Type Description
blob Large binary data with external storage reference
json JSON-encoded data stored as binary

Blob types are stored as large binary data with metadata describing storage location.

BFloat16 Type

Brain float (bfloat16) is a 16-bit floating point format optimized for ML. Used within fixed-size lists: fixed_size_list:lance.bfloat16:SIZE

Field IDs

Field IDs are unique integer identifiers assigned to each field in a schema. They are essential for robust schema evolution, as they allow fields to be renamed or reordered without breaking references.

Field ID Assignment

Initial assignment (depth-first order): When a table is created, field IDs are assigned to all fields in depth-first order, starting from 0.

Nested fields are linked via the parent_id field in the protobuf message. For example, if field "c" (id: 2) is a struct containing fields "x", "y", "z", those child fields will have parent_id: 2. Top-level fields have parent_id: -1.

Example with nested structure:

Field order: a, b, c.x, c.y, c.z, d

Assigned IDs with parent relationships:
- a: 0 (parent_id: -1)
- b: 1 (parent_id: -1)
- c: 2 (parent_id: -1, struct type)
- c.x: 3 (parent_id: 2)
- c.y: 4 (parent_id: 2)
- c.z: 5 (parent_id: 2)
- d: 6 (parent_id: -1)

Note: A parent_id of -1 indicates a top-level field. For nested fields, parent_id references the ID of the parent field. Child fields reference their parent via parent_id rather than being stored as separate "children" arrays in the protobuf message (though the Rust in-memory representation maintains a children vector for convenience).

New field assignment (incremental): When fields are added later (e.g., through schema evolution), they receive the next available ID incrementally. This preserves the history of field additions.

Field ID Properties

  • Immutable: Once assigned, a field's ID never changes
  • Unique: Each field within a table has a unique ID
  • Stable: IDs are preserved across schema evolution operations
  • Sparse: Field IDs may not form a contiguous sequence after schema evolution

Using Field IDs

When referencing fields internally within the format, use the field ids rather than field names or positions.

Field Metadata

Fields can carry additional metadata as key-value pairs to configure encoding, primary key behavior, and other properties.

Primary Key Metadata

Primary key configuration is handled by two protobuf fields in the Field message: - unenforced_primary_key (bool): Whether this field is part of the primary key - unenforced_primary_key_position (uint32): Position in primary key ordering (1-based for ordered, 0 for unordered)

For detailed discussion on primary key configuration, see Unenforced Primary Key in the table format overview.

Encoding Metadata

Column encoding configurations are specified with the lance-encoding: prefix. See File Format Encoding Specification for complete details on available encodings.

Arrow Extension Type Metadata

Custom Arrow extension types may have metadata under the ARROW:extension: namespace (e.g., ARROW:extension:name).

Schema Protobuf Definition

The schema is serialized using protobuf messages. Key messages include:

Field Message

message Field {
  enum Type {
    PARENT = 0;
    REPEATED = 1;
    LEAF = 2;
  }
  Type type = 1;

  // Fully qualified name.
  string name = 2;
  /// Field Id.
  ///
  /// See the comment in `DataFile.fields` for how field ids are assigned.
  int32 id = 3;
  /// Parent Field ID. If not set, this is a top-level column.
  int32 parent_id = 4;

  // Logical types, support parameterized Arrow Type.
  //
  // PARENT types will always have logical type "struct".
  //
  // REPEATED types may have logical types:
  // * "list"
  // * "large_list"
  // * "list.struct"
  // * "large_list.struct"
  // The final two are used if the list values are structs, and therefore the
  // field is both implicitly REPEATED and PARENT.
  //
  // LEAF types may have logical types:
  // * "null"
  // * "bool"
  // * "int8" / "uint8"
  // * "int16" / "uint16"
  // * "int32" / "uint32"
  // * "int64" / "uint64"
  // * "halffloat" / "float" / "double"
  // * "string" / "large_string"
  // * "binary" / "large_binary"
  // * "date32:day"
  // * "date64:ms"
  // * "decimal:128:{precision}:{scale}" / "decimal:256:{precision}:{scale}"
  // * "time:{unit}" / "timestamp:{unit}" / "duration:{unit}", where unit is
  // "s", "ms", "us", "ns"
  // * "dict:{value_type}:{index_type}:false"
  string logical_type = 5;
  // If this field is nullable.
  bool nullable = 6;

  // optional field metadata (e.g. extension type name/parameters)
  map<string, bytes> metadata = 10;  

  bool unenforced_primary_key = 12;

  // Position of this field in the primary key (1-based).
  // 0 means the field is part of the primary key but uses schema field id for ordering.
  // When set to a positive value, primary key fields are ordered by this position.
  uint32 unenforced_primary_key_position = 13;

  // DEPRECATED ----------------------------------------------------------------

  // Deprecated: Only used in V1 file format. V2 uses variable encodings defined
  // per page.
  //
  // The global encoding to use for this field.
  Encoding encoding = 7;

  // Deprecated: Only used in V1 file format. V2 dynamically chooses when to
  // do dictionary encoding and keeps the dictionary in the data files.
  //
  // The file offset for storing the dictionary value.
  // It is only valid if encoding is DICTIONARY.
  //
  // The logic type presents the value type of the column, i.e., string value.
  Dictionary dictionary = 8;

  // Deprecated: optional extension type name, use metadata field
  // ARROW:extension:name
  string extension_name = 9;

  // Field number 11 was previously `string storage_class`.
  // Keep it reserved so older manifests remain compatible while new writers
  // avoid reusing the slot.
  reserved 11;
  reserved "storage_class";

}

The Field message contains: - id: Unique field identifier (int32) - name: Field name (string) - type: Field type enum (PARENT, REPEATED, or LEAF) - logical_type: Logical type string representation (string) - e.g., "int64", "struct", "list" - nullable: Whether the field can be null (bool) - parent_id: Parent field ID for nested fields; -1 for top-level fields (int32) - metadata: Key-value pairs for additional configuration (map) - unenforced_primary_key: Whether this field is part of the primary key (bool) - unenforced_primary_key_position: Position in primary key ordering (uint32, 0 = unordered)

Schema Message

The complete schema is represented as a collection of top-level fields plus metadata.

Schema Evolution

Field IDs enable efficient schema evolution:

  • Add Column: Assign a new field ID and add to schema
  • Drop Column: Remove field from schema; its ID may be reused in some systems
  • Rename Column: Change field name; ID remains the same
  • Reorder Columns: Change field order in schema; IDs remain the same
  • Type Evolution: Data type can be changed. This might require rewriting the column in the data, depending on how the type was changed.

The use of field IDs ensures that data files can be correctly interpreted even as the schema changes over time.

Example Schemas

The examples below use a simplified representation of the field structure. In the actual protobuf format, type refers to the field type enum (PARENT/REPEATED/LEAF) and logical_type contains the data type string representation.

Simple Table

Field {
    id: 0
    name: "id"
    logical_type: "int64"
    nullable: false
    parent_id: -1
}
Field {
    id: 1
    name: "name"
    logical_type: "string"
    nullable: true
    parent_id: -1
}
Field {
    id: 2
    name: "created_at"
    logical_type: "timestamp:us:UTC"
    nullable: true
    parent_id: -1
}

Nested Structure

Field {
    id: 0
    name: "id"
    logical_type: "int64"
    nullable: false
    parent_id: -1  // Top-level field
}
Field {
    id: 1
    name: "user"
    logical_type: "struct"
    nullable: true
    parent_id: -1  // Top-level field
}
Field {
    id: 2
    name: "name"
    logical_type: "string"
    nullable: true
    parent_id: 1  // Nested under "user" struct (id: 1)
}
Field {
    id: 3
    name: "email"
    logical_type: "string"
    nullable: true
    parent_id: 1  // Nested under "user" struct (id: 1)
}
Field {
    id: 4
    name: "tags"
    logical_type: "list"
    nullable: true
    parent_id: -1  // Top-level field
}
Field {
    id: 5
    name: "item"
    logical_type: "string"
    nullable: true
    parent_id: 4  // Nested under "tags" list (id: 4)
}

With Vector Embeddings

Field {
    id: 0
    name: "id"
    logical_type: "int64"
    nullable: false
    parent_id: -1  // Top-level field
    unenforced_primary_key: true
    unenforced_primary_key_position: 1  // Ordered position in primary key
}
Field {
    id: 1
    name: "text"
    logical_type: "string"
    nullable: true
    parent_id: -1  // Top-level field
}
Field {
    id: 2
    name: "embedding"
    logical_type: "fixed_size_list:lance.bfloat16:384"
    nullable: true
    parent_id: -1  // Top-level field
}

Type Conversion Reference

When converting between logical types and Arrow types, Lance uses the following mappings:

Arrow Type Logical Type Format
Arrow::Null null
Arrow::Boolean bool
Arrow::Int8 to Int64 int8, int16, int32, int64
Arrow::UInt8 to UInt64 uint8, uint16, uint32, uint64
Arrow::Float16 halffloat
Arrow::Float32 float
Arrow::Float64 double
Arrow::Utf8 string
Arrow::LargeUtf8 large_string
Arrow::Binary binary
Arrow::LargeBinary large_binary
Arrow::Decimal128(p, s) decimal:128:p:s
Arrow::Decimal256(p, s) decimal:256:p:s
Arrow::Date32 date32:day
Arrow::Date64 date64:ms
Arrow::Time32(TimeUnit) time32:s, time32:ms
Arrow::Time64(TimeUnit) time64:us, time64:ns
Arrow::Timestamp(unit, tz) timestamp:unit:tz
Arrow::Duration(unit) duration:s, duration:ms, duration:us, duration:ns
Arrow::Struct struct
Arrow::List(Element) list or list.struct if element is Struct
Arrow::LargeList(Element) large_list or large_list.struct
Arrow::FixedSizeList(Element, Size) fixed_size_list:type:size
Arrow::FixedSizeBinary(Size) fixed_size_binary:size
Arrow::Dictionary(KeyType, ValueType) dict:value_type:key_type:false
Arrow::Map map