Skip to content

HYBRID_SEARCH

Run vector search and full-text search together from Spark SQL, then rerank the combined results with reciprocal rank fusion.

Spark Extension Required

HYBRID_SEARCH requires the Lance Spark SQL extension to be enabled. See Spark SQL Extensions for configuration details.

Namespace Tables Required

HYBRID_SEARCH resolves the table argument through a Spark catalog and executes both side queries through the Lance namespace queryTable API. Use a Lance namespace catalog table such as lance.default.documents, not a raw Lance dataset path.

Named Arguments

Named arguments require Spark 3.5 or later. On Spark 3.4, use the positional form.

Basic Usage

HYBRID_SEARCH returns the selected table columns plus _distance, _score, and _relevance_score. Rows that only match one side have null for the other side's metric.

SELECT id, body, _distance, _score, _relevance_score
FROM HYBRID_SEARCH(
    table => 'lance.default.documents',
    query_vector => array(0.12, 0.34, 0.56, 0.78),
    query => 'vector database',
    vector_column => 'embedding',
    search_columns => array('body'),
    columns => array('id', 'body'),
    num_results => 10,
    candidates => 50,
    rrf_k => 60.0
)
ORDER BY _relevance_score DESC;

Positional Form

Use positional arguments for simple calls and Spark 3.4 compatibility.

SELECT *
FROM HYBRID_SEARCH('lance.default.documents', array(0.12, 0.34, 0.56), 'lance', 5);

Arguments

Argument Type Required Description
table String Yes Catalog table name to search.
query_vector Array numeric literal Yes Query vector.
query or search_query String Yes Full-text query string.
vector_column String No Vector column name. Lance defaults to vector when omitted.
search_columns Array string literal No Text columns to search. When omitted, Lance uses the indexed columns configured for the FTS index.
num_results, limit, or k Integer No Number of final reranked results. Defaults to 10.
candidates, num_candidates, or candidate_count Integer No Number of rows to fetch from each side before reranking. Defaults to num_results + offset. Values below num_results + offset are raised to that minimum.
rrf_k Float No Reciprocal rank fusion constant. Defaults to 60.0.
columns Array string literal No Output table columns. _distance, _score, and _relevance_score are always included. Use array('*') or omit this argument for all table columns.
filter String No SQL filter expression evaluated by Lance on both side queries.
offset Integer No Number of reranked results to skip after fusion. Defaults to 0.
version Long No Lance table version to search.
distance_type String No Distance metric such as l2, cosine, or dot.
nprobes, ef, refine_factor Integer No Vector index search tuning parameters.
lower_bound, upper_bound Float No Distance bounds.
bypass_vector_index, fast_search, prefilter, with_row_id Boolean No Lance query options. with_row_id adds _rowid to the output.

Reranking

Hybrid search performs reciprocal rank fusion in Spark:

_relevance_score = sum(1.0 / (rank + rrf_k))

Ranks are zero-based in each side's result set. candidates controls how many rows are fetched from each side before reranking.

Output

The result includes the requested table columns plus nullable _distance and _score float columns and a non-null _relevance_score float column. If with_row_id => true, or if _rowid is listed in columns, the result also includes Lance row ids.

Execution

Spark plans HYBRID_SEARCH as a DataSource V2 batch read with one input partition. The partition reader issues one vector queryTable request and one full-text queryTable request through the Lance namespace API, merges the two result sets in Spark with reciprocal rank fusion, and returns the final rows. With a REST namespace the two side searches can be handled by the REST server, while the final fusion currently happens in the Spark task.

Validation

The Docker integration suite covers HYBRID_SEARCH against the directory namespace and a REST namespace backed by a directory namespace. The Spark Search Docker GitHub Actions workflow runs both backends for pull requests.