Exports all tables from a DuckDB connection to parquet files. Optionally strips provenance columns for public releases.
Usage
write_parquet_outputs(
con,
output_dir,
tables = NULL,
partition_by = NULL,
sort_by = NULL,
strip_provenance = TRUE,
compression = "snappy",
mismatches = NULL,
supplemental = NULL
)Arguments
- con
DuckDB connection
- output_dir
Directory for parquet files
- tables
Optional vector of table names to export. If NULL, exports all tables.
- partition_by
Named list mapping table names to partition column(s), e.g.
list(ctd_measurement = "cruise_key"). Creates hive-partitioned subdirectories.- sort_by
Named list mapping table names to sort column(s) for optimized row group statistics. For Hilbert-curve spatial sorting, use
"hilbert:lon_col,lat_col". Example:list(ctd_measurement = c("measurement_type", "depth_m"), site = "hilbert:longitude,latitude").- strip_provenance
Remove provenance columns (source*, _ingested_at) (default: TRUE)
- compression
Parquet compression codec (default: "snappy")
- mismatches
Optional named list of mismatch tibbles to include in manifest.json. Expected names:
ships,measurement_types,cruise_keys. Each element should be a tibble (or data frame) describing unresolved entities. Zero-row tibbles are recorded as empty arrays.- supplemental
Character vector of table names that are supplemental outputs (e.g. wide-format tables for ERDDAP). These are written to parquet and listed in manifest.json under
"supplemental", but are excluded by default from downstream database loading viaload_prior_tables().
Examples
if (FALSE) { # \dontrun{
con <- get_duckdb_con("calcofi.duckdb")
stats <- write_parquet_outputs(con, "output/parquet")
} # }