6 Data Access
CalCOFI database releases are published as Parquet files on a public Google Cloud Storage (GCS) bucket. You can query them directly with DuckDB — from R, Python, the DuckDB CLI, or any DuckDB client — with no credentials, no API server, and no full download. DuckDB reads only the columns and row groups a query actually touches, straight over HTTPS.
This page covers querying the release Parquet directly. For convenience-wrapped biological ↔︎ environmental matching, see Matching Helpers.
6.1 Where the data lives
Each release is a versioned folder:
gs://calcofi-db/ducklake/releases/{version}/
├── catalog.json # table list, row counts, "partitioned" flag
├── relationships.json # primary keys + foreign keys
├── RELEASE_NOTES.md
└── parquet/
├── {table}.parquet # single-file tables
└── {table}/cruise_key=.../*.parquet # hive-partitioned tables
- Public HTTPS base:
https://storage.googleapis.com/calcofi-db/ducklake/releases/{version}/parquet {version}is e.g.v2026.05.14; the current release pointer is atreleases/latest.txtand the table list atreleases/{version}/catalog.json.- Most tables are single-file (
{table}.parquet). The large CTD tablesctd_thinandctd_summaryare hive-partitioned bycruise_key(catalog.jsonflags these with"partitioned": true).
6.2 Setup: the httpfs extension
DuckDB reads remote Parquet through its httpfs extension. Spatial queries (e.g. distances between casts and tows) also need spatial. Both are one-time installs, then loaded per session:
INSTALL httpfs; LOAD httpfs;
INSTALL spatial; LOAD spatial; -- only if you use ST_* functions6.3 Single-file tables
A single-file table is just an HTTPS URL handed to read_parquet():
SELECT species_id, scientific_name, common_name, worms_id
FROM read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/species.parquet')
WHERE scientific_name = 'Sardinops sagax';Joins work the same way — name each table’s URL. The biological hierarchy is ichthyo → net → tow → site (and ichthyo → species); the environmental hierarchy is bottle_measurement → bottle → casts:
SELECT i.ichthyo_uuid, i.life_stage, i.tally, sp.scientific_name, t.time_start
FROM read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/ichthyo.parquet') i
JOIN read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/species.parquet') sp ON i.species_id = sp.species_id
JOIN read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/net.parquet') n ON i.net_uuid = n.net_uuid
JOIN read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/tow.parquet') t ON n.tow_uuid = t.tow_uuid
WHERE sp.scientific_name = 'Sardinops sagax'
LIMIT 10;6.4 Hive-partitioned tables
ctd_thin and ctd_summary are partitioned into one folder per cruise_key. Reading them needs an s3-style glob (plain HTTPS URLs can’t glob), so configure DuckDB’s anonymous s3 access — GCS is s3-compatible:
INSTALL httpfs; LOAD httpfs;
SET s3_region = 'auto';
SET s3_endpoint = 'storage.googleapis.com';
SET s3_url_style = 'path';
SET s3_access_key_id = '';
SET s3_secret_access_key = '';
SELECT cruise_key, count(*) AS n
FROM read_parquet(
's3://calcofi-db/ducklake/releases/v2026.05.14/parquet/ctd_thin/**/*.parquet',
hive_partitioning = true)
GROUP BY cruise_key
ORDER BY cruise_key;Because cruise_key is the partition column, filtering on it lets DuckDB prune whole partitions — a single-cruise query never opens the other ~96 folders.
6.5 From R
Use DBI + duckdb directly:
library(DBI)
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "INSTALL httpfs; LOAD httpfs;")
base <- "https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet"
d <- dbGetQuery(con, sprintf(
"SELECT scientific_name, common_name, worms_id
FROM read_parquet('%s/species.parquet')
WHERE common_name ILIKE '%%sardine%%'", base))Or let the calcofi4r package register every release table as a view for you — it handles httpfs, the partitioned tables, and local caching:
# remotes::install_github("calcofi/calcofi4r")
library(calcofi4r)
con <- cc_get_db() # latest release, tables as views
DBI::dbListTables(con)
# lazy dbplyr against the remote parquet
library(dplyr)
tbl(con, "species") |> filter(scientific_name == "Sardinops sagax")
# or a one-off SQL query
cc_query("SELECT count(*) FROM ichthyo")6.6 From Python
import duckdb
con = duckdb.connect()
con.sql("INSTALL httpfs; LOAD httpfs;")
base = "https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet"
df = con.sql(f"""
SELECT scientific_name, common_name, worms_id
FROM read_parquet('{base}/species.parquet')
WHERE common_name ILIKE '%sardine%'
""").df()6.7 Reproducibility
Because every query is plain SQL against immutable, versioned, public Parquet, a CalCOFI result is reproducible by anyone — pin the {version} and re-run the SQL.
The Integrated App builds on this: its data download bundle ships a query/ folder alongside the data:
data/original/{bio,env}.csv ← query/{bio,env}.sql
data/integrated/integrated_*.csv ← query/integrated_*.sql
query/manifest.json release version, filters, GCS source URLs,
per-file row counts + md5 checksums
query/REPRODUCE.md DuckDB-CLI / Python / R re-run snippets
Each *.sql file is fully interpolated, GCS-URL-based, and copy-paste runnable (prefixed with the INSTALL/LOAD it needs). Re-run query/integrated_*.sql in DuckDB and you get back exactly the rows in the matching .csv — the manifest.json md5 checksums let you confirm it. The same SQL is what calcofi4r::cc_match_bio_env() executes and attaches as attr(x, "sql"), so the app, the package, and a hand-written query are all the same single source of truth.
6.8 Worked example: sardine larvae + temperature
The recurring example through these pages is Pacific sardine (Sardinops sagax) larvae matched to CTD-bottle temperature, Q1 2018, with relaxed (5 km / 72 hr) matching.
Q1 2018 — not a more recent year — because CTD-bottle environmental data in the current release ends 2021-05, while net-tow biological data runs later. Q1 2018 has ample overlap of both.
Done as direct SQL, the query is a temporal interval join plus a spatial ST_Distance_Sphere filter. After the INSTALL/LOAD setup above, run the query below. This block is character-for-character what calcofi4r::cc_match_ichthyo_by_name() returns as attr(d, "sql") — the two produce the identical 13 rows. The Integrated App download bundle builds query/integrated_*.sql with the same cc_match_bio_env() engine (its filter set is taxa + quarters + dates rather than life_stage, so a sardine / Q1 2018 bundle returns the egg + larva superset — same matching mechanics, same reproducibility):
WITH bio AS (
SELECT
i.ichthyo_uuid::VARCHAR AS bio_id,
t.time_start AS bio_datetime,
s.longitude AS bio_lon,
s.latitude AS bio_lat,
n.std_haul_factor * i.tally / nullif(n.prop_sorted, 0) AS bio_value,
sp.scientific_name,
sp.common_name,
sp.worms_id,
i.life_stage,
i.tally
FROM read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/ichthyo.parquet') i
JOIN read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/species.parquet') sp ON i.species_id = sp.species_id
JOIN read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/net.parquet') n ON i.net_uuid = n.net_uuid
JOIN read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/tow.parquet') t ON n.tow_uuid = t.tow_uuid
JOIN read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/site.parquet') s ON t.site_uuid = s.site_uuid
WHERE i.tally IS NOT NULL
AND i.measurement_type IS NULL
AND t.time_start IS NOT NULL
AND s.longitude IS NOT NULL
AND s.latitude IS NOT NULL
AND sp.scientific_name IN ('Sardinops sagax')
AND i.life_stage IN ('larva')
AND t.time_start >= TIMESTAMP '2018-01-01'
AND t.time_start <= TIMESTAMP '2018-03-31'
),
env AS (
SELECT
bm.bottle_measurement_id AS env_id,
c.datetime_utc AS env_datetime,
c.lon_dec AS env_lon,
c.lat_dec AS env_lat,
bm.measurement_value AS env_value,
b.depth_m AS env_depth_m,
bm.measurement_type AS measurement_type
FROM read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/bottle_measurement.parquet') bm
JOIN read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/bottle.parquet') b ON bm.bottle_id = b.bottle_id
JOIN read_parquet('https://storage.googleapis.com/calcofi-db/ducklake/releases/v2026.05.14/parquet/casts.parquet') c ON b.cast_id = c.cast_id
WHERE bm.measurement_type = 'temperature'
AND bm.measurement_value IS NOT NULL
AND c.datetime_utc IS NOT NULL
AND c.lon_dec IS NOT NULL
AND c.lat_dec IS NOT NULL
AND c.datetime_utc >= TIMESTAMP '2018-01-01' - INTERVAL '72 hours'
AND c.datetime_utc <= TIMESTAMP '2018-03-31' + INTERVAL '72 hours'
),
matched AS (
-- temporal interval join: every env observation within ± max_time_hr
SELECT
bio.*,
env.* EXCLUDE (env_lon, env_lat),
abs(epoch(bio.bio_datetime) - epoch(env.env_datetime)) / 3600.0 AS time_diff_hr,
ST_Distance_Sphere(
ST_Point(bio.bio_lon, bio.bio_lat),
ST_Point(env.env_lon, env.env_lat)) / 1000.0 AS dist_km
FROM bio
JOIN env
ON env.env_datetime BETWEEN bio.bio_datetime - INTERVAL '72 hours'
AND bio.bio_datetime + INTERVAL '72 hours'
),
within AS (
-- spatial filter: keep pairs within max_dist_km
SELECT * FROM matched
WHERE dist_km <= 5
),
ranked AS (
SELECT
*,
min(time_diff_hr) OVER (PARTITION BY bio_id) AS mn_time_diff_hr,
min(dist_km) OVER (PARTITION BY bio_id) AS mn_dist_km
FROM within
)
-- one row per bio observation (× measurement_type): env values aggregated
SELECT
* EXCLUDE (
env_id, env_value, env_datetime, env_depth_m,
time_diff_hr, dist_km, mn_time_diff_hr, mn_dist_km),
count(*) AS n_env,
avg(env_value) AS env_value,
CASE WHEN count(*) = 1 THEN 0
ELSE coalesce(stddev_samp(env_value), 0) END AS env_value_sd,
avg(env_depth_m) AS env_depth_m,
min(env_datetime) AS env_datetime_min,
max(env_datetime) AS env_datetime_max,
avg(dist_km) AS dist_km,
avg(time_diff_hr) AS time_diff_hr
FROM ranked
WHERE time_diff_hr = mn_time_diff_hr
GROUP BY ALL
ORDER BY bio_idThe next page, Matching Helpers, shows this same query as a one-liner.