Dataset Configuration
Overview
Section titled “Overview”Amp uses a primary TOML configuration file to define settings for writing and serving datasets. A complete example is available in config.sample.toml.
Set the file path using the AMP_CONFIG environment variable.
Dataset extraction requires three storage directories:
manifests_dir— Dataset definitions used as input for extraction.providers_dir— Provider definitions for external data sources such as Firehose.data_dir— Extracted parquet tables (initially empty on first use).
This structure provides flexibility and modularity across environments.
Environment Variable Overrides
Section titled “Environment Variable Overrides”All configuration values can be overridden using environment variables prefixed with AMP_CONFIG_.
Basic Override
Section titled “Basic Override”export AMP_CONFIG_DATA_DIR=/path/to/dataNested Values
Section titled “Nested Values”Use double underscores (__) to represent nested fields:
| Config Key | Environment Variable |
|---|---|
| metadata_db.url | AMP_CONFIG_METADATA_DB__URL |
| metadata_db.pool_size | AMP_CONFIG_METADATA_DB__POOL_SIZE |
| writer.compression | AMP_CONFIG_WRITER__COMPRESSION |
Service Addresses
Section titled “Service Addresses”Optional keys let you customize host and port bindings:
| Key | Service | Default |
|---|---|---|
flight_addr | Arrow Flight RPC server | 0.0.0.0:1602 |
jsonl_addr | JSON Lines server | 0.0.0.0:1603 |
admin_api_addr | Admin API server | 0.0.0.0:1610 |
Logging
Section titled “Logging”error | warn | info | debug | traceDefault: debug.
For granular control, use RUST_LOG.
Object Store Configuration
Section titled “Object Store Configuration”Directory fields (*_dir) accept either local filesystem paths or object store URLs. Object stores are recommended for production workloads.
S3-Compatible Stores
Section titled “S3-Compatible Stores”URL Format
Section titled “URL Format”s3://<bucket>Environment Variables
Section titled “Environment Variables”AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_DEFAULT_REGIONAWS_ENDPOINTAWS_SESSION_TOKENAWS_ALLOW_HTTP (enables non-TLS)
Google Cloud Storage (GCS)
Section titled “Google Cloud Storage (GCS)”URL Format
Section titled “URL Format”gs://<bucket>``Authentication Options
Section titled “Authentication Options”GOOGLE_SERVICE_ACCOUNT_PATHGOOGLE_SERVICE_ACCOUNT_KEY- Application Default Credentials (ADC)
Datasets
Section titled “Datasets”Identity and Versioning
Section titled “Identity and Versioning”A dataset identity consists of:
- Namespace (e.g., edgeandnode, my*org, or *)
- Name (e.g., eth_mainnet)
- Version/Revision (1.0.0, latest, dev, or a manifest hash)
Reference Format
Section titled “Reference Format”namespace/name@revisionExamples
my_org/eth_mainnet@1.0.0my_org/eth_mainnet@latestmy_org/eth_mainnet@dev_ /eth_mainnet@latest
SQL Schema Names
Section titled “SQL Schema Names”Datasets appear as quoted schemas in SQL:
SELECT * FROM "namespace/name".table_name;Examples
Section titled “Examples”"my_org/eth_mainnet".blocks"my_org/eth_mainnet".logs
Quoting is required because schema names use /.
Dataset Categories
Section titled “Dataset Categories”Amp supports:
- Raw datasets — Directly extracted from external data sources (Firehose, EVM RPC).
- Derived datasets — SQL transformations built on top of existing datasets.
Details for the raw datasets currently implemented:
EVM RPC dataset docs Firehose dataset docs
Generating Raw Dataset Manifests
Section titled “Generating Raw Dataset Manifests”Use ampctl gen-manifest to generate JSON manifests defining schema and extraction configuration.
## Examples
# EVM RPC datasetampctl gen-manifest --network mainnet --kind evm-rpc --name eth_mainnet
# With custom start blockampctl gen-manifest --network mainnet --kind evm-rpc --name eth_mainnet --start-block 1000000
# Firehose datasetampctl gen-manifest --network mainnet --kind firehose --name eth_firehose
# Write to a specific fileampctl gen-manifest --network mainnet --kind evm-rpc --name eth_mainnet -o ./manifests_dir/eth_mainnet.json
# Write to a directory (creates ./manifests/evm-rpc.json)ampctl gen-manifest --network mainnet --kind evm-rpc --name eth_mainnet -o ./manifests/
# Finalized blocks onlyampctl gen-manifest --network mainnet --kind evm-rpc --name eth_mainnet --finalized-blocks-onlyParameters
Section titled “Parameters”| Flag | Description |
|---|---|
--network | Target network (mainnet, goerli, polygon, anvil) |
--kind | Dataset type (evm-rpc, firehose, eth-beacon) |
--name | Dataset name |
--out, -o | File or directory to write output |
--start-block | Start block (default: 0) |
--finalized-blocks-only | Include only finalized blocks |
The output manifest includes the complete table and column schema.
Registering and Deploying Datasets
Section titled “Registering and Deploying Datasets”# Register dataset (updates dev tag)ampctl dataset register my_namespace/eth_mainnet ./manifest.json
# Register with a specific versionampctl dataset register my_namespace/eth_mainnet ./manifest.json --tag 1.0.0
# Deploy for extractionampctl dataset deploy my_namespace/eth_mainnet@1.0.0Providers
Section titled “Providers”Providers define external data sources used to extract raw blockchain data. Each provider is defined in a TOML file stored inside providers_dir.
Environment variable substitution is supported using ${VAR_NAME}.
Provider Types
Section titled “Provider Types”Valid values for the kind field:
evm-rpc1— Ethereum JSON-RPC (HTTP/WebSocket/IPC)firehose— Firehose gRPCeth-beacon— Ethereum Beacon Chain REST API
Base Structure
Section titled “Base Structure”Every provider configuration must include:
kind: Provider typenetwork: Network identifier (e.g.,mainnet,goerli)
Provider name defaults to the filename (minus .toml) unless overridden with name.
Sample Provider Configurations
Section titled “Sample Provider Configurations”Available sample directory:
-
evm-rpc.sample.toml - Configuration for Ethereum-compatible JSON-RPC endpoints. Includes fields for URL (HTTP/WebSocket/IPC), concurrent request limits, RPC batching, rate limiting, and receipt fetching options.
-
firehose.sample.toml - Configuration for StreamingFast Firehose gRPC endpoints. Includes fields for gRPC URL and authentication token.
-
eth-beacon.sample.toml - Configuration for Ethereum Beacon Chain REST API endpoints. Includes fields for API URL, concurrent request limits, and rate limiting.
Each sample documents required and optional fields along with defaults.