Reads CSV files from a directory and prepares them for ingestion into a database. This function is the primary entry point for the CalCOFI data ingestion workflow. It performs the following steps:
Usage
read_csv_files(
provider,
dataset,
dir_data = "~/My Drive/projects/calcofi/data",
url_gdata =
"https://drive.google.com/drive/u/0/folders/1xxdWa4mWkmfkJUQsHxERTp9eBBXBMbV7",
use_gdrive = TRUE,
email = "ben@ecoquants.com"
)
Arguments
- provider
Data provider (e.g., "swfsc.noaa.gov")
- dataset
Dataset name (e.g., "calcofi-db")
- dir_data
directory path of CalCOFI base data folder available locally, with CSVs under provider/dataset directory. Default: "~/My Drive/projects/calcofi/data"
- url_gdata
URL of CalCOFI base data folder in Google Drive (with CSVs under provider/dataset directory) with metadata information on CSVs. Default: data - Google Drive
- use_gdrive
Whether to query Google Drive for metadata. Default: TRUE
Google Drive authentication email (if use_gdrive=TRUE). Default: "ben@ecoquants.com"
Value
A list containing:
- d_csv
List with CSV data including: - data: tibble with columns (tbl, csv, data, nrow, ncol, flds) - tables: summary of tables (tbl, nrow, ncol) - fields: summary of fields (tbl, fld, type)
- d_gdata
Google Drive metadata (if use_gdrive=TRUE) including file names, IDs, modification times, and web links
- d_tbls_rd
Table redefinition data frame with columns: tbl_old, tbl_new, tbl_description
- d_flds_rd
Field redefinition data frame with columns: tbl_old, tbl_new, fld_old, fld_new, order_old, order_new, type_old, type_new, fld_description, notes, mutation
- workflow_info
Information about the workflow including workflow name, QMD file path, and URL
- paths
List of file paths used in the workflow
Details
Reads all CSV files from the specified provider/dataset directory
Extracts metadata about tables and fields from the CSV files
Creates or reads redefinition files for table and field transformations
Optionally queries Google Drive for file metadata (creation dates, etc.)
The function returns a comprehensive data structure containing:
Raw CSV data and metadata (d_csv)
Table redefinitions (d_tbls_rd) for renaming/describing tables
Field redefinitions (d_flds_rd) for renaming/typing/transforming fields
Google Drive metadata if requested (d_gdata)
Workflow information and file paths
Examples
if (FALSE) { # \dontrun{
# Basic usage
d <- read_csv_files(
provider = "swfsc.noaa.gov",
dataset = "calcofi-db")
# Access the raw CSV data
d$d_csv$data
# Check table redefinitions
d$d_tbls_rd
# Check field redefinitions
d$d_flds_rd
# Without Google Drive metadata
d <- read_csv_files(
provider = "swfsc.noaa.gov",
dataset = "calcofi-db",
use_gdrive = FALSE)
} # }