DuckDB¶
Lance datasets can be queried in SQL with DuckDB, an in-process OLAP relational database. Using DuckDB means you can write complex SQL queries (that may not yet be supported in Lance), without needing to move your data out of Lance.
Note
This integration is done via a DuckDB extension, whose source code and latest documentation (via README.md) is available
here.
To ensure you see the most up-to-date examples and syntax, check out the repo and the
DuckDB extension
documentation page.
Installation¶
Python dependencies¶
- To use DuckDB's CLI, install it using the steps shown in their docs.
- To run the code in Python, install Lance, DuckDB and PyArrow as shown below.
Install the Lance extension in DuckDB¶
We're now ready to begin querying Lance using DuckDB! First, install the extension.
Update extensions
If you already have the extension installed locally, run the following command to update it to the latest version:
Examples¶
All examples below reuse a small dataset with three rows (duck, horse, dragon)
and a vector column with representative values. In the real world, you'd have
a high-dimensional array generated by an embedding model, and a much larger Lance dataset.
Write a DuckDB table as a Lance dataset¶
Use DuckDB's COPY ... TO ... to materialize query results as a Lance dataset.
Query a Lance dataset from DuckDB¶
Now that the Lance dataset is written, let's query it using SQL in DuckDB.
This returns:
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ float[] │
├─────────┼─────────┼─────────────────┤
│ duck │ quack │ [0.9, 0.7, 0.1] │
│ horse │ neigh │ [0.3, 0.1, 0.5] │
│ dragon │ roar │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘
Query S3 paths directly
To access object store URIs (such as s3://...), configure a TYPE LANCE secret.
Create a Lance dataset via CREATE TABLE (directory namespace)¶
When you ATTACH a directory as a Lance namespace, you can create new datasets
using CREATE TABLE or CREATE TABLE AS SELECT. The dataset is written to
<namespace_root>/<table_name>.lance.
import duckdb
duckdb.sql(
"""
ATTACH './lance_ns' AS lance_ns (TYPE LANCE);
CREATE TABLE lance_ns.main.duck_animals AS
SELECT *
FROM (
VALUES
('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]),
('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]),
('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[])
) AS t(animal, noise, vector);
"""
)
You can then query the namespace as follows:
Vector search¶
You can perform vector search on a column. This returns the _distance
(smaller is closer, so sort in ascending order for nearest neighbors). The example vector here is similar to the query "duck".
This returns:
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ float[] │
├─────────┼─────────┼─────────────────┤
│ duck │ quack │ [0.9, 0.7, 0.1] │
└─────────┴─────────┴─────────────────┘
Full-text search¶
Run keyword-based BM25 search as shown below. This returns a _score, which
is sorted in descending order to get the most relevant results.
This returns:
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ float[] │
├─────────┼─────────┼─────────────────┤
│ dragon │ roar │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘
Hybrid search¶
Hybrid search combines vector and FTS scores, returning a _hybrid_score in addition
to _distance / _score. To get the most relevant results, sort in descending order.
import duckdb
r4 = duckdb.sql(
"""
SELECT animal, noise, vector, _hybrid_score, _distance, _score
FROM lance_hybrid_search(
'./lance_duck.lance',
'vector',
[0.8, 0.7, 0.2]::FLOAT[],
'animal',
'the duck surprised the dragon',
k = 2,
prefilter = false,
alpha = 0.5,
oversample_factor = 4
)
ORDER BY _hybrid_score DESC;
"""
)
print(r4)
This returns:
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ float[] │
├─────────┼─────────┼─────────────────┤
│ duck │ quack │ [0.9, 0.7, 0.1] │
│ dragon │ roar │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘
Warning
DuckDB treats column as a keyword in some contexts. It's recommended to
use text_column / vector_column as column names for the Lance extension.
Source repo¶
Check out the lance-duckdb project
for the latest source code, and go through README.md for the latest API docs.
Additional pages are listed below.
Full SQL reference¶
sql.md lists the current SQL surface supported by this extension. It's recommended to refer to this page for the most up-to-date information.
Cloud storage reference¶
cloud.md lists the current supported backends that allow you to access data on various cloud providers.
- S3 / S3-compatible:
s3://...(also acceptss3a://...ands3n://..., normalized tos3://...) - Google Cloud Storage:
gs://... - Azure Blob Storage:
az://... - Alibaba Cloud OSS:
oss://... - Hugging Face Hub (OpenDAL):
hf://...