Object Store Configuration¶

Lance supports object stores such as AWS S3 (and compatible stores), Azure Blob Store, and Google Cloud Storage. Which object store to use is determined by the URI scheme of the dataset path. For example, s3://bucket/path will use S3, az://bucket/path will use Azure, and gs://bucket/path will use GCS.

These object stores take additional configuration objects. There are two ways to specify these configurations: by setting environment variables or by passing them to the storage_options parameter of lance.dataset and lance.write_dataset. So for example, to globally set a higher timeout, you would run in your shell:

export TIMEOUT=60s

If you only want to set the timeout for a single dataset, you can pass it as a storage option:

import lance
ds = lance.dataset("s3://path", storage_options={"timeout": "60s"})

General Configuration¶

These options apply to all object stores.

Key	Description
`allow_http`	Allow non-TLS, i.e. non-HTTPS connections. Default, `False`.
`download_retry_count`	Number of times to retry a download. Default, `3`. This limit is applied when the HTTP request succeeds but the response is not fully downloaded, typically due to a violation of `request_timeout`.
`allow_invalid_certificates`	Skip certificate validation on https connections. Default, `False`. Warning: This is insecure and should only be used for testing.
`connect_timeout`	Timeout for only the connect phase of a Client. Default, `5s`.
`request_timeout`	Timeout for the entire request, from connection until the response body has finished. Default, `30s`.
`user_agent`	User agent string to use in requests.
`proxy_url`	URL of a proxy server to use for requests. Default, `None`.
`proxy_ca_certificate`	PEM-formatted CA certificate for proxy connections
`proxy_excludes`	List of hosts that bypass proxy. This is a comma separated list of domains and IP masks. Any subdomain of the provided domain will be bypassed. For example, `example.com, 192.168.1.0/24` would bypass `https://api.example.com`, `https://www.example.com`, and any IP in the range `192.168.1.0/24`.
`client_max_retries`	Number of times for the object store client to retry the request. Default, `3`.
`client_retry_timeout`	Timeout for the object store client to retry the request in seconds. Default, `180`.

S3 Configuration¶

S3 (and S3-compatible stores) have additional configuration options that configure authorization and S3-specific features (such as server-side encryption).

AWS credentials can be set in the environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN. Alternatively, they can be passed as parameters to the storage_options parameter:

import lance
ds = lance.dataset(
    "s3://bucket/path",
    storage_options={
        "access_key_id": "my-access-key",
        "secret_access_key": "my-secret-key",
        "session_token": "my-session-token",
    }
)

If you are using AWS SSO, you can specify the AWS_PROFILE environment variable. It cannot be specified in the storage_options parameter.

The following keys can be used as both environment variables or keys in the storage_options parameter:

Key	Description
`aws_region` / `region`	The AWS region the bucket is in. This can be automatically detected when using AWS S3, but must be specified for S3-compatible stores.
`aws_access_key_id` / `access_key_id`	The AWS access key ID to use.
`aws_secret_access_key` / `secret_access_key`	The AWS secret access key to use.
`aws_session_token` / `session_token`	The AWS session token to use.
`aws_endpoint` / `endpoint`	The endpoint to use for S3-compatible stores.
`aws_virtual_hosted_style_request` / `virtual_hosted_style_request`	Whether to use virtual hosted-style requests, where bucket name is part of the endpoint. Meant to be used with `aws_endpoint`. Default, `False`.
`aws_s3_express` / `s3_express`	Whether to use S3 Express One Zone endpoints. Default, `False`. See more details below.
`aws_server_side_encryption`	The server-side encryption algorithm to use. Must be one of `"AES256"`, `"aws:kms"`, or `"aws:kms:dsse"`. Default, `None`.
`aws_sse_kms_key_id`	The KMS key ID to use for server-side encryption. If set, `aws_server_side_encryption` must be `"aws:kms"` or `"aws:kms:dsse"`.
`aws_sse_bucket_key_enabled`	Whether to use bucket keys for server-side encryption.

S3-compatible stores¶

Lance can also connect to S3-compatible stores, such as MinIO. To do so, you must specify both region and endpoint:

import lance
ds = lance.dataset(
    "s3://bucket/path",
    storage_options={
        "region": "us-east-1",
        "endpoint": "http://minio:9000",
    }
)

This can also be done with the AWS_ENDPOINT and AWS_DEFAULT_REGION environment variables.

S3 Express (Directory Bucket)¶

Lance supports S3 Express One Zone buckets, a.k.a. S3 directory buckets. S3 Express buckets only support connecting from an EC2 instance within the same region. By default, Lance automatically recognize the --x-s3 suffix of an express bucket, there is no special configuration needed.

In case of an access point or private link that hides the bucket name, you can configure express bucket access explicitly through storage option s3_express.

import lance
ds = lance.dataset(
    "s3://my-bucket--use1-az4--x-s3/path/imagenet.lance",
    storage_options={
        "region": "us-east-1",
        "s3_express": "true",
    }
)

Google Cloud Storage Configuration¶

GCS credentials are configured by setting the GOOGLE_SERVICE_ACCOUNT environment variable to the path of a JSON file containing the service account credentials. Alternatively, you can pass the path to the JSON file in the storage_options

import lance
ds = lance.dataset(
    "gs://my-bucket/my-dataset",
    storage_options={
        "service_account": "path/to/service-account.json",
    }
)

Note

By default, GCS uses HTTP/1 for communication, as opposed to HTTP/2. This improves maximum throughput significantly. However, if you wish to use HTTP/2 for some reason, you can set the environment variable HTTP1_ONLY to false.

The following keys can be used as both environment variables or keys in the storage_options parameter:

Key	Description
`google_service_account` / `service_account`	Path to the service account JSON file.
`google_service_account_key` / `service_account_key`	The serialized service account key.
`google_application_credentials` / `application_credentials`	Path to the application credentials.

Azure Blob Storage Configuration¶

Azure Blob Storage credentials can be configured by setting the AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY environment variables. Alternatively, you can pass the account name and key in the storage_options parameter:

import lance
ds = lance.dataset(
    "az://my-container/my-dataset",
    storage_options={
        "account_name": "some-account",
        "account_key": "some-key",
    }
)

These keys can be used as both environment variables or keys in the storage_options parameter:

Key	Description
`azure_storage_account_name` / `account_name`	The name of the azure storage account.
`azure_storage_account_key` / `account_key`	The serialized service account key.
`azure_client_id` / `client_id`	Service principal client id for authorizing requests.
`azure_client_secret` / `client_secret`	Service principal client secret for authorizing requests.
`azure_tenant_id` / `tenant_id`	Tenant id used in oauth flows.
`azure_storage_sas_key` / `azure_storage_sas_token` / `sas_key` / `sas_token`	Shared access signature. The signature is expected to be percent-encoded, much like they are provided in the azure storage explorer or azure portal.
`azure_storage_token` / `bearer_token` / `token`	Bearer token.
`azure_storage_use_emulator` / `object_store_use_emulator` / `use_emulator`	Use object store with azurite storage emulator.
`azure_endpoint` / `endpoint`	Override the endpoint used to communicate with blob storage.
`azure_use_fabric_endpoint` / `use_fabric_endpoint`	Use object store with url scheme account.dfs.fabric.microsoft.com.
`azure_msi_endpoint` / `azure_identity_endpoint` / `identity_endpoint` / `msi_endpoint`	Endpoint to request a imds managed identity token.
`azure_object_id` / `object_id`	Object id for use with managed identity authentication.
`azure_msi_resource_id` / `msi_resource_id`	Msi resource id for use with managed identity authentication.
`azure_federated_token_file` / `federated_token_file`	File containing token for Azure AD workload identity federation.
`azure_use_azure_cli` / `use_azure_cli`	Use azure cli for acquiring access token.
`azure_disable_tagging` / `disable_tagging`	Disables tagging objects. This can be desirable if not supported by the backing store.

AliCloud Object Storage Service Configuration¶

OSS credentials can be set in the environment variables OSS_ACCESS_KEY_ID, OSS_ACCESS_KEY_SECRET, OSS_REGION, and OSS_SECURITY_TOKEN. Alternatively, they can be passed as parameters to the storage_options parameter:

import lance
ds = lance.dataset(
    "oss://bucket/path",
    storage_options={
        "oss_region": "oss-region",
        "oss_endpoint": "oss-endpoint",
        "oss_access_key_id": "my-access-key",
        "oss_secret_access_key": "my-secret-key",
        "oss_security_token": "my-session-token",
    }
)

Key	Description
`oss_endpoint`	OSS endpoint. Required (for example, `https://oss-cn-hangzhou.aliyuncs.com`).
`oss_access_key_id`	Access key ID used for OSS authentication. Optional if credentials are provided by environment.
`oss_secret_access_key`	Access key secret used for OSS authentication. Optional if credentials are provided by environment.
`oss_region`	OSS region (for example, `cn-hangzhou`). Optional.
`oss_security_token`	Security token for temporary credentials (STS). Optional.

Volcengine TOS Configuration¶

TOS credentials can be set in the environment variables TOS_ACCESS_KEY_ID, TOS_SECRET_ACCESS_KEY, TOS_ENDPOINT, TOS_REGION, and TOS_SECURITY_TOKEN. Lance also accepts the corresponding VOLCENGINE_ environment variable prefix. Alternatively, credentials can be passed as parameters to the storage_options parameter; explicit storage_options override environment variables:

import lance
ds = lance.dataset(
    "tos://bucket/path",
    storage_options={
        "tos_endpoint": "https://tos-cn-beijing.volces.com",
        "tos_region": "cn-beijing",
        "tos_access_key_id": "my-access-key",
        "tos_secret_access_key": "my-secret-key",
        "tos_security_token": "my-session-token",
    }
)

Key	Description
`tos_endpoint`	TOS endpoint. Required (for example, `https://tos-cn-beijing.volces.com`).
`tos_region`	TOS signing region (for example, `cn-beijing`). Optional.
`tos_access_key_id`	Access key ID used for TOS authentication. Optional if credentials are provided by environment.
`tos_secret_access_key`	Secret access key used for TOS authentication. Optional if credentials are provided by environment.
`tos_security_token`	Security token for temporary credentials. Optional.

Tencent Cloud COS Configuration¶

COS (Cloud Object Storage) credentials can be set in environment variables prefixed with COS_ or TENCENTCLOUD_ (for example, COS_ENDPOINT, COS_SECRET_ID, COS_SECRET_KEY, TENCENTCLOUD_REGION, TENCENTCLOUD_SECURITY_TOKEN). Alternatively, credentials can be passed as parameters to the storage_options parameter; explicit storage_options override environment variables:

PythonRust

import lance
ds = lance.dataset(
    "cos://bucket/path",
    storage_options={
        "cos_endpoint": "https://cos.ap-guangzhou.myqcloud.com",
        "cos_secret_id": "my-secret-id",
        "cos_secret_key": "my-secret-key",
    }
)

In this Lance distribution, tencent is already part of the default features of the lance crate, so simply depending on lance is enough:

[dependencies]
lance = "*"

You only need to enable the tencent feature explicitly in the following cases:

You opted out of default features, e.g. lance = { version = "*", default-features = false, features = ["tencent", ...] }.
You depend on lance-io directly (without lance); tencent is not a default feature of lance-io: lance-io = { version = "*", features = ["tencent"] }.

Key	Description
`cos_endpoint`	COS endpoint. Required (for example, `https://cos.ap-guangzhou.myqcloud.com`). Can also be set via the `COS_ENDPOINT` environment variable.
`cos_secret_id`	Secret ID used for COS authentication. Optional if credentials are provided by environment.
`cos_secret_key`	Secret key used for COS authentication. Optional if credentials are provided by environment.
`cos_enable_versioning`	Whether to enable object versioning on the bucket. Optional.

Note

The OpenDAL CosConfig currently exposes a limited set of options. Additional settings such as the security token (TENCENTCLOUD_SECURITY_TOKEN) and region (TENCENTCLOUD_REGION) must be configured via environment variables.

GooseFS Configuration¶

GooseFS is a distributed caching filesystem. Lance accesses GooseFS through its Master gRPC service. The URL format is goosefs://host:port/path, where host:port is the GooseFS Master address (default port: 9200, may be omitted, e.g. goosefs://10.0.0.1/path) and /path is the filesystem path within GooseFS.

About the dataset path

/path is just an arbitrary directory inside GooseFS — Lance does not require the path to end with a .lance suffix. Any valid GooseFS directory works, for example:

goosefs://10.0.0.1:9200/data/my-dataset
goosefs://10.0.0.1:9200/data/my-dataset.lance
goosefs://10.0.0.1:9200/lance-test/lance-io

The .lance suffix used in the examples below is only a naming convention that makes it easy to recognize a Lance dataset directory at a glance; it has no special meaning to Lance itself. The only requirement is that the same path is used consistently for reads and writes of a given dataset.

PythonRustJava

import lance

ds = lance.dataset(
    "goosefs://10.0.0.1:9200/data/my-dataset.lance",
    storage_options={
        "goosefs_auth_type": "simple",
        "goosefs_auth_username": "lance",
    },
)

In this Lance distribution, goosefs is already part of the default features of the lance crate, so simply depending on lance is enough:

[dependencies]
lance = "*"

You only need to enable the goosefs feature explicitly in the following cases:

You opted out of default features, e.g. lance = { version = "*", default-features = false, features = ["goosefs", ...] }.
You depend on lance-io directly (without lance); goosefs is not a default feature of lance-io: lance-io = { version = "*", features = ["goosefs"] }.

Open the underlying lance_io::object_store::ObjectStore directly (mirrors the integration test in rust/lance-io/tests/goosefs_integration.rs):

use lance_io::object_store::ObjectStore;

let uri = "goosefs://10.0.0.1:9200/lance-test/lance-io";
let (store, path) = ObjectStore::from_uri(uri).await?;

// Read / write through the underlying `object_store::ObjectStore` API
store.inner.put(&path, (&b"hello"[..]).into()).await?;
let result = store.inner.get(&path).await?;
let bytes = result.bytes().await?;

Open a Lance dataset with custom storage options:

use std::collections::HashMap;
use lance::dataset::DatasetBuilder;

let mut storage_options = HashMap::new();
storage_options.insert("goosefs_master_addr".to_string(), "10.0.0.1:9200".to_string());
storage_options.insert("goosefs_auth_type".to_string(), "simple".to_string());
storage_options.insert("goosefs_auth_username".to_string(), "lance".to_string());

let dataset = DatasetBuilder::from_uri("goosefs://10.0.0.1:9200/data/my-dataset.lance")
    .with_storage_options(storage_options)
    .load()
    .await?;

Pass the GooseFS configuration through ReadOptions.setStorageOptions when opening the dataset:

import org.lance.Dataset;
import org.lance.ReadOptions;

import java.util.HashMap;
import java.util.Map;

Map<String, String> storageOptions = new HashMap<>();
storageOptions.put("goosefs_master_addr", "10.0.0.1:9200");
storageOptions.put("goosefs_auth_type", "simple");
storageOptions.put("goosefs_auth_username", "lance");

ReadOptions options = new ReadOptions.Builder()
    .setStorageOptions(storageOptions)
    .build();

try (Dataset dataset = Dataset.open()
        .uri("goosefs://10.0.0.1:9200/data/my-dataset.lance")
        .readOptions(options)
        .build()) {
    // ... use the dataset
}

For writes, the same storageOptions(...) setter is available on WriteDatasetBuilder and WriteFragmentBuilder.

The Master address can be resolved from (in priority order):

The goosefs_master_addr storage option (supports HA: "addr1:port,addr2:port").
The GOOSEFS_MASTER_ADDR environment variable.
The host and port from the URL authority.

The following keys can be used as both environment variables or keys in the storage_options parameter:

Key	Description
`goosefs_master_addr` / `GOOSEFS_MASTER_ADDR`	GooseFS Master address. Supports a single address (`host:port`) or comma-separated HA addresses (`addr1:port,addr2:port`). Optional if the address is provided in the URL.
`goosefs_write_type` / `GOOSEFS_WRITE_TYPE`	Write type, e.g. `MUST_CACHE`, `CACHE_THROUGH`, `THROUGH`, `ASYNC_THROUGH`. Optional.
`goosefs_block_size` / `GOOSEFS_BLOCK_SIZE`	GooseFS block size in bytes (this is the GooseFS-side block size, not Lance's I/O block size). Optional.
`goosefs_chunk_size` / `GOOSEFS_CHUNK_SIZE`	Chunk size in bytes used when reading or writing files. Optional.
`goosefs_auth_type` / `GOOSEFS_AUTH_TYPE`	Authentication type. Either `nosasl` or `simple` (case-insensitive; the value is passed through to OpenDAL). Optional.
`goosefs_auth_username` / `GOOSEFS_AUTH_USERNAME`	Username used in `simple` authentication mode. Optional.

Running the GooseFS integration tests

The Rust integration tests for GooseFS live at rust/lance-io/tests/goosefs_integration.rs and are gated behind feature flags. They require a reachable GooseFS cluster (configured via the GOOSEFS_MASTER_ADDR and GOOSEFS_AUTH_TYPE environment variables) and can be run with:

cargo test -p lance-io --features "goosefs goosefs-test" \
    --test goosefs_integration -- --ignored --nocapture --test-threads=1