Skip to content

SELECT

Query data from Lance tables using SQL or DataFrames.

Basic Queries

SELECT * FROM users;
# Load table as DataFrame
users_df = spark.table("users")

# Use DataFrame operations
filtered_users = users_df.filter("age > 25").select("name", "email")
filtered_users.show()
// Load table as DataFrame
val usersDF = spark.table("users")

// Use DataFrame operations
val filteredUsers = usersDF.filter("age > 25").select("name", "email")
filteredUsers.show()
// Load table as DataFrame
Dataset<Row> usersDF = spark.table("users");

// Use DataFrame operations
Dataset<Row> filteredUsers = usersDF.filter("age > 25").select("name", "email");
filteredUsers.show();

Select Specific Columns

SELECT id, name, email FROM users;

Query with WHERE Clause

SELECT * FROM users WHERE age > 25;

Aggregate Queries

SELECT department, COUNT(*) as employee_count
FROM users
GROUP BY department;

Join Queries

SELECT u.name, p.title
FROM users u
JOIN projects p ON u.id = p.user_id;

Querying Blob Columns

When querying tables with blob columns, the blob data itself is not materialized by default. Instead, you can access blob metadata through virtual columns. For each blob column, Lance provides two virtual columns:

  • <column_name>__blob_pos - The byte position of the blob in the blob file
  • <column_name>__blob_size - The size of the blob in bytes

These virtual columns can be used for:

  • Monitoring blob storage statistics
  • Filtering rows by blob size
  • Implementing custom blob retrieval logic
  • Verifying successful blob writes
SELECT id, title, content__blob_pos, content__blob_size
FROM documents
WHERE id = 1;
# Read table with blob column
documents_df = spark.table("documents")

# Access blob metadata using virtual columns
blob_metadata = documents_df.select(
    "id",
    "title",
    "content__blob_pos",
    "content__blob_size"
)
blob_metadata.show()

# Filter by blob size
large_blobs = documents_df.filter("content__blob_size > 1000000")
large_blobs.select("id", "title", "content__blob_size").show()
// Read table with blob column
val documentsDF = spark.table("documents")

// Access blob metadata using virtual columns
val blobMetadata = documentsDF.select(
  "id",
  "title",
  "content__blob_pos",
  "content__blob_size"
)
blobMetadata.show()

// Filter by blob size
val largeBlobs = documentsDF.filter("content__blob_size > 1000000")
largeBlobs.select("id", "title", "content__blob_size").show()
// Read table with blob column
Dataset<Row> documentsDF = spark.table("documents");

// Access blob metadata using virtual columns
Dataset<Row> blobMetadata = documentsDF.select(
    "id",
    "title",
    "content__blob_pos",
    "content__blob_size"
);
blobMetadata.show();

// Filter by blob size
Dataset<Row> largeBlobs = documentsDF.filter("content__blob_size > 1000000");
largeBlobs.select("id", "title", "content__blob_size").show();

Read Options

These options control how data is read from Lance datasets. They can be set using the .option() method when reading data.

Option Type Default Description
batch_size Integer 512 Number of rows to read per batch during scanning. Larger values may improve throughput but increase memory usage.
version Integer Latest Specific dataset version to read. If not specified, reads the latest version.
block_size Integer - Block size in bytes for reading data.
index_cache_size Integer - Size of the index cache in number of entries.
metadata_cache_size Integer - Size of the metadata cache in number of entries.
pushDownFilters Boolean true Whether to push down filter predicates to the Lance reader for optimized scanning.
topN_push_down Boolean true Whether to push down TopN (ORDER BY ... LIMIT) operations to Lance for optimized sorting.

Example: Reading with Options

-- Read a specific version using table options
SELECT * FROM lance.`/path/to/dataset.lance` VERSION AS OF 10;
# Reading with options
df = spark.read \
    .format("lance") \
    .option("batch_size", "1024") \
    .option("version", "5") \
    .load("/path/to/dataset.lance")
// Reading with options
val df = spark.read
    .format("lance")
    .option("batch_size", "1024")
    .option("version", "5")
    .load("/path/to/dataset.lance")
// Reading with options
Dataset<Row> df = spark.read()
    .format("lance")
    .option("batch_size", "1024")
    .option("version", "5")
    .load("/path/to/dataset.lance");

Example: Tuning Batch Size for Performance

# Larger batch size for better throughput on large scans
df = spark.read \
    .format("lance") \
    .option("batch_size", "4096") \
    .load("/path/to/dataset.lance")
// Larger batch size for better throughput on large scans
val df = spark.read
    .format("lance")
    .option("batch_size", "4096")
    .load("/path/to/dataset.lance")