Core engine relating biological observations to environmental observations by
matching them on time and space. Builds a single DuckDB SQL string — a
temporal interval join (within max_time_hr) plus a spatial filter
(ST_Distance_Sphere within max_dist_km) — that reads directly from the
Parquet files of a frozen CalCOFI release on Google Cloud Storage, runs it,
and attaches the fully-interpolated SQL and query metadata as attributes.
Usage
cc_match_bio_env(
bio,
env,
max_dist_km = 2,
max_time_hr = 6,
join_method = c("nearest_time", "nearest_dist", "average"),
con = NULL,
version = "latest",
collect = TRUE,
return_sql = FALSE
)Arguments
- bio
SQL
SELECTstring producing the biological side. Must yield columnsbio_id(unique per observation),bio_datetime(TIMESTAMP),bio_lon,bio_lat(decimal degrees) andbio_value(DOUBLE). Any additional columns (e.g.scientific_name) are carried through to the output as grouping keys.- env
SQL
SELECTstring producing the environmental side. Must yield exactlyenv_id,env_datetime(TIMESTAMP),env_lon,env_lat,env_value(DOUBLE),env_depth_m(DOUBLE) andmeasurement_type.- max_dist_km
Maximum match distance in kilometers (default: 2).
- max_time_hr
Maximum match time difference in hours (default: 6).
- join_method
One of
"nearest_time"(default — keep the env observation(s) closest in time, averaging ties),"nearest_dist"(closest in space) or"average"(average every env observation in the window).- con
Optional DuckDB connection. If
NULL(default) a temporary in-memory connection is created (and closed again whencollect = TRUE).- version
Release version string (e.g.
"v2026.05.14") or"latest"(default). Recorded in the query metadata; the actual table URLs come frombio/env.- collect
If
TRUE(default) execute and return a tibble; ifFALSEreturn a lazydplyr::tbl()reference.- return_sql
If
TRUE, return the interpolated SQL string (withquery_metaattached) without executing. Default:FALSE.
Value
When return_sql = TRUE, a length-1 character vector of SQL with
attr(., "query_meta"). Otherwise a tibble (or lazy tbl) of one row per
biological observation per measurement_type, with the matched
env_value, env_value_sd, env_depth_m, n_env, dist_km and
time_diff_hr, plus all bio columns. The result carries attr(., "sql")
and attr(., "query_meta") (package + release version, parameters and GCS
source URLs).
Details
Most users want a wrapper (cc_match_ichthyo_by_name(),
cc_match_ichthyo_by_taxon() or cc_match_zooplankton_biomass()) rather
than calling cc_match_bio_env() directly.
Examples
if (FALSE) { # \dontrun{
# build bio + env subqueries yourself, or use a wrapper
d <- cc_match_ichthyo_by_name("Sardinops sagax", env_var = "temperature")
# the exact, portable SQL that produced it
cat(attr(d, "sql"))
str(attr(d, "query_meta"))
} # }