SELECT¶
Query data from Lance tables using SQL or DataFrames.
Basic Queries¶
Select Specific Columns¶
Query with WHERE Clause¶
Aggregate Queries¶
Join Queries¶
Querying Blob Columns¶
When querying tables with blob columns, the blob data itself is not materialized by default. Instead, you can access blob metadata through virtual columns. For each blob column, Lance provides two virtual columns:
<column_name>__blob_pos- The byte position of the blob in the blob file<column_name>__blob_size- The size of the blob in bytes
These virtual columns can be used for:
- Monitoring blob storage statistics
- Filtering rows by blob size
- Implementing custom blob retrieval logic
- Verifying successful blob writes
# Read table with blob column
documents_df = spark.table("documents")
# Access blob metadata using virtual columns
blob_metadata = documents_df.select(
"id",
"title",
"content__blob_pos",
"content__blob_size"
)
blob_metadata.show()
# Filter by blob size
large_blobs = documents_df.filter("content__blob_size > 1000000")
large_blobs.select("id", "title", "content__blob_size").show()
// Read table with blob column
val documentsDF = spark.table("documents")
// Access blob metadata using virtual columns
val blobMetadata = documentsDF.select(
"id",
"title",
"content__blob_pos",
"content__blob_size"
)
blobMetadata.show()
// Filter by blob size
val largeBlobs = documentsDF.filter("content__blob_size > 1000000")
largeBlobs.select("id", "title", "content__blob_size").show()
// Read table with blob column
Dataset<Row> documentsDF = spark.table("documents");
// Access blob metadata using virtual columns
Dataset<Row> blobMetadata = documentsDF.select(
"id",
"title",
"content__blob_pos",
"content__blob_size"
);
blobMetadata.show();
// Filter by blob size
Dataset<Row> largeBlobs = documentsDF.filter("content__blob_size > 1000000");
largeBlobs.select("id", "title", "content__blob_size").show();
Read Options¶
These options control how data is read from Lance datasets. They can be set using the .option() method when reading data.
| Option | Type | Default | Description |
|---|---|---|---|
batch_size |
Integer | 512 |
Number of rows to read per batch during scanning. Larger values may improve throughput but increase memory usage. |
version |
Integer | Latest | Specific dataset version to read. If not specified, reads the latest version. |
block_size |
Integer | - | Block size in bytes for reading data. |
index_cache_size |
Integer | - | Size of the index cache in number of entries. |
metadata_cache_size |
Integer | - | Size of the metadata cache in number of entries. |
pushDownFilters |
Boolean | true |
Whether to push down filter predicates to the Lance reader for optimized scanning. |
topN_push_down |
Boolean | true |
Whether to push down TopN (ORDER BY ... LIMIT) operations to Lance for optimized sorting. |