Skip to content

DuckDB

Lance datasets can be queried in SQL with DuckDB, an in-process OLAP relational database. Using DuckDB means you can write complex SQL queries (that may not yet be supported in Lance), without needing to move your data out of Lance.

Note

This integration is done via a DuckDB extension, whose source code and latest documentation (via README.md) is available here. To ensure you see the most up-to-date examples and syntax, check out the repo and the DuckDB extension documentation page.

Installation

Python dependencies

  • To use DuckDB's CLI, install it using the steps shown in their docs.
  • To run the code in Python, install Lance, DuckDB and PyArrow as shown below.
pip install pylance duckdb pyarrow

Install the Lance extension in DuckDB

We're now ready to begin querying Lance using DuckDB! First, install the extension.

INSTALL lance FROM community;
LOAD lance;
import duckdb

duckdb.sql(
    """
    INSTALL lance FROM community;
    LOAD lance;
    """
)
Update extensions

If you already have the extension installed locally, run the following command to update it to the latest version:

UPDATE EXTENSIONS;

Examples

All examples below reuse a small dataset with three rows (duck, horse, dragon) and a vector column with representative values. In the real world, you'd have a high-dimensional array generated by an embedding model, and a much larger Lance dataset.

Write a DuckDB table as a Lance dataset

Use DuckDB's COPY ... TO ... to materialize query results as a Lance dataset.

COPY (
  SELECT *
  FROM (
    VALUES
      ('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]),
      ('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]),
      ('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[])
  ) AS t(animal, noise, vector)
) TO './lance_duck.lance' (FORMAT lance, mode 'overwrite');
import duckdb

duckdb.sql(
    """
    COPY (
      SELECT *
      FROM (
        VALUES
          ('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]),
          ('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]),
          ('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[])
      ) AS t(animal, noise, vector)
    ) TO './lance_duck.lance' (FORMAT lance, mode 'overwrite');
    """
)

Query a Lance dataset from DuckDB

Now that the Lance dataset is written, let's query it using SQL in DuckDB.

SELECT *
  FROM './lance_duck.lance'
  LIMIT 5;
import duckdb

r1 = duckdb.sql(
    """
    SELECT *
      FROM './lance_duck.lance'
      LIMIT 5;
    """
)
print(r1)

This returns:

┌─────────┬─────────┬─────────────────┐
│ animal  │  noise  │     vector      │
│ varchar │ varchar │     float[]     │
├─────────┼─────────┼─────────────────┤
│ duck    │ quack   │ [0.9, 0.7, 0.1] │
│ horse   │ neigh   │ [0.3, 0.1, 0.5] │
│ dragon  │ roar    │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘
Query S3 paths directly

To access object store URIs (such as s3://...), configure a TYPE LANCE secret.

CREATE SECRET (
  TYPE LANCE,
  PROVIDER credential_chain,
  SCOPE 's3://bucket/'
);

SELECT *
  FROM 's3://bucket/path/to/dataset.lance'
  LIMIT 5;

Create a Lance dataset via CREATE TABLE (directory namespace)

When you ATTACH a directory as a Lance namespace, you can create new datasets using CREATE TABLE or CREATE TABLE AS SELECT. The dataset is written to <namespace_root>/<table_name>.lance.

ATTACH './lance_ns' AS lance_ns (TYPE LANCE);

CREATE TABLE lance_ns.main.duck_animals AS
  SELECT *
  FROM (
    VALUES
      ('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]),
      ('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]),
      ('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[])
  ) AS t(animal, noise, vector);
import duckdb

duckdb.sql(
    """
    ATTACH './lance_ns' AS lance_ns (TYPE LANCE);

    CREATE TABLE lance_ns.main.duck_animals AS
      SELECT *
      FROM (
        VALUES
          ('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]),
          ('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]),
          ('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[])
      ) AS t(animal, noise, vector);
    """
)

You can then query the namespace as follows:

SELECT count(*) FROM lance_ns.main.duck_animals;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      3       │
└──────────────┘

You can perform vector search on a column. This returns the _distance (smaller is closer, so sort in ascending order for nearest neighbors). The example vector here is similar to the query "duck".

SELECT animal, noise, vector, _distance
  FROM lance_vector_search(
    './lance_duck.lance',
    'vector',
    [0.8, 0.7, 0.2]::FLOAT[],
    k = 1,
    prefilter = true
  )
  ORDER BY _distance ASC;
import duckdb

r2 = duckdb.sql(
    """
    SELECT animal, noise, vector, _distance
      FROM lance_vector_search(
        './lance_duck.lance',
        'vector',
        [0.8, 0.7, 0.2]::FLOAT[],
        k = 1,
        prefilter = true
      )
      ORDER BY _distance ASC;
    """
)
print(r2)

This returns:

┌─────────┬─────────┬─────────────────┐
│ animal  │  noise  │     vector      │
│ varchar │ varchar │     float[]     │
├─────────┼─────────┼─────────────────┤
│ duck    │ quack   │ [0.9, 0.7, 0.1] │
└─────────┴─────────┴─────────────────┘

Run keyword-based BM25 search as shown below. This returns a _score, which is sorted in descending order to get the most relevant results.

SELECT animal, noise, vector, _score
  FROM lance_fts(
    './lance_duck.lance',
    'animal',
    'the brave knight faced the dragon',
    k = 1,
    prefilter = true
  )
  ORDER BY _score DESC;
import duckdb

r3 = duckdb.sql(
    """
    SELECT animal, noise, vector, _score
      FROM lance_fts(
        './lance_duck.lance',
        'animal',
        'the brave knight faced the dragon',
        k = 1,
        prefilter = true
      )
      ORDER BY _score DESC;
    """
)
print(r3)

This returns:

┌─────────┬─────────┬─────────────────┐
│ animal  │  noise  │     vector      │
│ varchar │ varchar │     float[]     │
├─────────┼─────────┼─────────────────┤
│ dragon  │ roar    │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘

Hybrid search combines vector and FTS scores, returning a _hybrid_score in addition to _distance / _score. To get the most relevant results, sort in descending order.

SELECT animal, noise, vector, _hybrid_score, _distance, _score
  FROM lance_hybrid_search(
    './lance_duck.lance',
    'vector',
    [0.8, 0.7, 0.2]::FLOAT[],
    'animal',
    'the duck surprised the dragon',
    k = 2,
    prefilter = false,
    alpha = 0.5,
    oversample_factor = 4
  )
  ORDER BY _hybrid_score DESC;
import duckdb

r4 = duckdb.sql(
    """
    SELECT animal, noise, vector, _hybrid_score, _distance, _score
      FROM lance_hybrid_search(
        './lance_duck.lance',
        'vector',
        [0.8, 0.7, 0.2]::FLOAT[],
        'animal',
        'the duck surprised the dragon',
        k = 2,
        prefilter = false,
        alpha = 0.5,
        oversample_factor = 4
      )
      ORDER BY _hybrid_score DESC;
    """
)
print(r4)

This returns:

┌─────────┬─────────┬─────────────────┐
│ animal  │  noise  │     vector      │
│ varchar │ varchar │     float[]     │
├─────────┼─────────┼─────────────────┤
│ duck    │ quack   │ [0.9, 0.7, 0.1] │
│ dragon  │ roar    │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘

Warning

DuckDB treats column as a keyword in some contexts. It's recommended to use text_column / vector_column as column names for the Lance extension.

Source repo

Check out the lance-duckdb project for the latest source code, and go through README.md for the latest API docs. Additional pages are listed below.

Full SQL reference

sql.md lists the current SQL surface supported by this extension. It's recommended to refer to this page for the most up-to-date information.

Cloud storage reference

cloud.md lists the current supported backends that allow you to access data on various cloud providers.

  • S3 / S3-compatible: s3://... (also accepts s3a://... and s3n://..., normalized to s3://...)
  • Google Cloud Storage: gs://...
  • Azure Blob Storage: az://...
  • Alibaba Cloud OSS: oss://...
  • Hugging Face Hub (OpenDAL): hf://...