Scrape & Ingest CalCOFI CTD Data into DuckDB

Published

2026-01-02

1 Overview

This notebook scrapes CalCOFI CTD data from https://calcofi.org/data/oceanographic-data/ctd-cast-files/, downloads only final and preliminary CTD files, and ingests them into a DuckDB database.

1.1 Key Features

Web Scraping: Scrapes for all CTD .zip download links
Smart Filtering:
- Downloads all .zip files for archival completeness
- Only unzips final (e.g., 20-2105SH_CTDFinalQC.zip) and preliminary (e.g., 20-2507SR_CTDPrelim.zip) files
- Skips unzipping raw/cast/test files to save disk space
Download & Unzip: Downloads all zip files to ~/My Drive/projects/calcofi/data/calcofi.org/ctd-cast/download (calcofi/data) but only extracts final and preliminary data
Priority-based Selection:
- For each cruise (identified by yy-yymm + ship code), selects final if available
- Falls back to preliminary if final not available for that cruise
- Completely excludes raw/test/prodo cast files
Union Strategy:
- Merges upcast (U) and downcast (D) CSV files into single “ctd” table
- Adds stage flag: “F” for final, “P” for preliminary
- Adds cast_dir flag: “U” for upcast, “D” for downcast
Data Standardization:
- Converts all column names to lower_snake_case using janitor::clean_names()
- Creates spatial index on lat_dec and lon_dec fields
- Preserves cruise_id and source_file metadata

1.2 Usage

To run the notebook:

R: quarto::quarto_render("scrape_ctd.qmd")
Terminal: quarto render scrape_ctd.qmd

1.3 Notes

All .zip files are downloaded for archival purposes, but only final/prelim are unzipped
Only processes CSV files from final (db-csvs/*.csv) and preliminary directories
Downloads can be resumed (already-downloaded files are skipped)
Both upcast (U) and downcast (D) profiles are included in the database
The priority system ensures you only get final data when available, otherwise preliminary
Spatial index enables efficient geographic queries on lat_dec and lon_dec

2 Setup

Code

librarian::shelf(
  "CalCOFI/calcofi4r",
  dplyr,
  DBI,
  DT,
  duckdb,
  fs,
  glue,
  here,
  htmltools,
  httr,
  janitor,
  litedown,
  lubridate,
  mapview,
  plotly,
  purrr,
  readr,
  rvest,
  sf,
  stringr,
  tibble,
  tidyr,
  zip,
  quiet = T
)

# helper functions
dt <- function(d, fname, n_page = 10, ...) {
  # d <- data.frame(x = 1:9)
  lengths <- c(5, 10, 25, 50, 100, 1000)
  lengths <- c(lengths[lengths < nrow(d)], nrow(d)) |> unique()

  DT::datatable(
    d,
    extensions = 'Buttons',
    options = list(
      dom = 'Blfrtpi',
      buttons = list(
        list(
          extend = 'csv',
          filename = fname
        )
      ),
      pageLength = min(n_page, nrow(d)),
      lengthMenu = lengths,
      scrollX = T
    ),
    filter = 'top',
    # rownames = FALSE,
    ...
  )
}

# variables
url <- "https://calcofi.org/data/oceanographic-data/ctd-cast-files/"
dir_dl <- "~/My Drive/projects/calcofi/data/calcofi.org/ctd-cast/download"
dir_dl <- path_expand(dir_dl)
bind_prelonlat_rds <- glue("{dir_dl}/bind_pre-lon-lat-clean.rds")
pts_filt_rds <- glue("{dir_dl}/pts_filt.rds")
ctd_pq <- glue("{dir_dl}/ctd.parquet")
db_path <- glue("{dir_dl}/calcofi-ctd.duckdb")
db_redo <- T # set to TRUE to re-download and re-ingest

# create directory if it doesn't exist
dir_create(dir_dl, recurse = TRUE)

# delete if exixsts (for fresh run)
if (file_exists(db_path) & db_redo) {
  file_delete(db_path)
}

# setup duckdb with spatial extension
# dbDisconnect(con, shutdown = TRUE); duckdb_shutdown(duckdb()); rm(con)
con <- dbConnect(duckdb::duckdb(read_only = F), dbdir = db_path) # dbListTables(con)
dbExecute(con, "INSTALL spatial;")

[1] 0

Code

dbExecute(con, "LOAD spatial;")

[1] 0

url: https://calcofi.org/data/oceanographic-data/ctd-cast-files/

3 Scrape CTD Download Links

Code

# read web page, extract all .zip download links
d_zips <- read_html(url) |>
  html_nodes("a[href$='.zip']") |>
  html_attr("href") |>
  tibble(url = _) |>
  mutate(
    # ensure full URLs
    url = if_else(
      str_starts(url, "http"),
      url,
      paste0("https://calcofi.org", url)
    ),
    # extract file_zip
    file_zip = basename(url),
    # extract year from URL path
    year = str_extract(url, "/(\\d{4})/", group = 1) |> as.integer(),
    # extract month from URL path
    month = str_extract(url, "-\\d{2}(\\d{2})", group = 1) |> as.integer(),
    # extract cruise identifier (yymm + ship code)
    cruise_id = str_extract(
      file_zip,
      "\\d{2}-(\\d{4}[A-Z0-9]{2,4})",
      group = 1
    ),
    # TODO: confirm oddball ship codes (normally 2 letters): 0404NHJD (4 letters),  0907M2 (letter + number)
    # classify data type based on file_zip patterns
    zip_type = case_when(
      str_detect(file_zip, "CTDFinal") ~ "final",
      str_detect(file_zip, "CTDPrelim") ~ "preliminary",
      str_detect(file_zip, "CTDCast|CTD_Cast") ~ "raw",
      str_detect(file_zip, "CTDTest") ~ "test",
      TRUE ~ "unknown"
    )
  )

# stop if other detected
if (any(d_zips$zip_type == "unknown")) {
  stop("Unknown zip_type detected in file_zips")
}

# show summary (all files)
d_zips |>
  mutate(
    file_zip = glue("<a href={url}>{file_zip}</a>")
  ) |>
  select(-url) |>
  dt(
    caption = "All zip files to download",
    fname = "ctd_zip_files",
    escape = F
  )

4 Download and Unzip Files

Code

# function to download and optionally unzip a file
download_and_unzip <- function(url, dest_dir, zip_type, unzip = TRUE) {
  file_zip <- basename(url)
  dest_file <- file.path(dest_dir, file_zip)
  dir_unzip <- file.path(dest_dir, str_remove(file_zip, "\\.zip$"))

  # skip if already downloaded
  if (file_exists(dest_file)) {
    message(glue("Already exists: {file_zip}"))
  } else {
    message(glue("Downloading: {file_zip}"))
    download.file(url, dest_file, mode = "wb")
  }

  # skip if already unzipped
  if (unzip && dir_exists(dir_unzip)) {
    message(glue("Already unzipped: {file_zip}"))
  } else {
    # only unzip if final or preliminary (case-insensitive)
    if (zip_type %in% c("final", "preliminary")) {
      message(glue("Unzipping: {file_zip}"))

      dir_create(dir_unzip, recurse = TRUE)

      unzip(dest_file, exdir = dir_unzip)
    } else if (unzip) {
      message(glue("Skipping unzip (not final/preliminary): {file_zip}"))
    }
  }
}

# download all files, but only unzip final and preliminary
d_zips |>
  pwalk(function(url, file_zip, zip_type, ...) {
    download_and_unzip(url, dir_dl, zip_type, unzip = TRUE)
  })

message("Download and extraction complete!")

5 Find and Prioritize CTD Data Files

Code

# create file inventory
d_csv <- tibble(
  path = dir_ls(dir_dl, recurse = TRUE, regexp = "\\.csv$")
) |>
  mutate(
    file_csv = basename(path),
    path_unzip = str_replace(path, glue("{dir_dl}/"), ""),
    # get first directory in path_unzip
    dir_unzip = str_extract(path_unzip, "^[^/]+"),
    cruise_id = str_extract(
      path_unzip,
      "\\d{2}-(\\d{4}[A-Z0-9]{2,4})_.*",
      group = 1
    ),
    # classify data type from parent directory
    data_stage = case_when(
      str_detect(path_unzip, "Final.*db[_|-]csv") ~ "final",
      str_detect(path_unzip, "Prelim.*db[_|-]csv") ~ "preliminary",
      # TODO: note these strange exceptions
      cruise_id == "2507SR" &
        str_detect(path_unzip, "Prelim.*csv") ~ "preliminary",
      cruise_id == "2111SR" &
        str_detect(
          path_unzip,
          "Prelim.*csvs-plots.*2111SR.*csv"
        ) ~ "preliminary",
      .default = NA_character_
    ),
    # detect cast direction from file_csv (U=upcast, D=downcast)
    cast_dir = case_when(
      str_detect(file_csv, regex("U\\.csv$", ignore_case = T)) ~ "U",
      str_detect(file_csv, regex("D\\.csv$", ignore_case = T)) ~ "D"
    ),
    # prioritize: final=1, preliminary=2
    priority = case_when(
      data_stage == "final" ~ 1,
      data_stage == "preliminary" ~ 2,
      TRUE ~ 3
    )
  ) |>
  relocate(cruise_id, path_unzip) |>
  arrange(cruise_id, path_unzip) |>
  # keep only final and preliminary files
  filter(data_stage %in% c("final", "preliminary"))

# for each cruise, keep only final if available, otherwise preliminary
d_priority <- d_csv |>
  group_by(cruise_id) |>
  summarize(
    best_priority = min(priority),
    .groups = "drop"
  )

d_csv <- d_csv |>
  inner_join(d_priority, by = "cruise_id") |>
  filter(priority == best_priority) |>
  select(-best_priority)

cruises_csv_notzip <- setdiff(
  unique(d_csv$cruise_id),
  unique(d_zips$cruise_id)
) |>
  sort()
stopifnot(length(cruises_csv_notzip) == 0)
# TODO: cruises_zip_notcsv: find workable form of raw data
cruises_zip_notcsv <- setdiff(
  unique(d_zips$cruise_id),
  unique(d_csv$cruise_id)
) |>
  sort()
#  [1] "0001NH" "0004JD" "0007NH" "0010NH" "0101JD" "0104JD" "0107NH" "0110NH" "0201JD" "0204JD" "0207NH"
# [12] "0211NH" "0304RR" "0404NH" "1203NH" "9003JD" "9011NH" "9101JD" "9103JD" "9108JD" "9110NH" "9202JD"
# [23] "9204JD" "9207NH" "9210NH" "9301JD" "9304JD" "9308NH" "9310NH" "9401JD" "9403JD" "9408NH" "9410NH"
# [34] "9501JD" "9504NH" "9507JD" "9510NH" "9602JD" "9604JD" "9608NH" "9610RR" "9702JD" "9704NH" "9707JD"
# [45] "9709NH" "9712SP" "9803SP" "9805SP" "9808SP" "9809NH" "9810SP" "9811SP" "9812SP" "9901RR" "9904JD"
# [56] "9908NH" "9910NH"
# TODO: plot(ctdDecimate(ctdTrim(read.ctd("stn123.cnv")))) per [oce](https://dankelley.github.io/oce/articles/B_ctd.html#raw-data)

# librarian::shelf(oce, quiet = T)
# dir_raw <- "/Users/bbest/Library/CloudStorage/GoogleDrive-ben@ecoquants.com/My Drive/projects/calcofi/data/calcofi.org/ctd-cast/download/19-9003JD_CTDTest"
# raw_9003JD_dat <- '/STA3-2.DAT'
# d_ctd <- read.ctd(dir_ls(dir_raw)) |> plot(ctdDecimate(ctdTrim(.)))

# if (length(missing_cruises) > 0) {
#   warning(glue("Warning: Missing cruises with no final/preliminary files: {paste(missing_cruises, collapse = ', ')}"))
#   # 0404JD
# }

# summary
d_csv |>
  select(-any_of(c("path", "data", "col_empties", "col_types"))) |>
  relocate(cruise_id, path_unzip) |>
  arrange(cruise_id, path_unzip) |>
  dt(
    caption = "Files to Ingest",
    fname = "ctd_files_to_ingest"
  )

6 Read and Standardize CTD Files

Remove repeat header rows that can occur within files, eg line 69 of ‘20-0711NH_CTDFinalQC/db-csvs/20-0711_CTDBTL_001-067Dno001b.csv’

Code

# read and combine all files (both upcast and downcast)
d_csv <- d_csv |>
  arrange(basename(path)) |>
  mutate(
    data = map2(path, seq_along(path), \(path, idx) {
      message(glue("Reading {idx}/{nrow(d_csv)}: {basename(path)}"))

      # detect repeat header rows and skip them
      # eg line 69 of '20-0711NH_CTDFinalQC/db-csvs/20-0711_CTDBTL_001-067Dno001b.csv'

      # Read all lines to detect repeat headers
      all_lines <- read_lines(path)
      header_line <- all_lines[1]

      # Find all rows that exactly match the header (excluding the first row)
      repeat_header_rows <- which(all_lines[-1] == header_line)

      if (length(repeat_header_rows) > 0) {
        # Remove repeat header rows from the data
        all_lines <- all_lines[-(repeat_header_rows + 1)]

        # Write cleaned data to temporary file
        tmp_file <- tempfile(fileext = ".csv")
        write_lines(all_lines, tmp_file)

        # Read from cleaned file
        data <- read_csv(tmp_file, guess_max = Inf, show_col_types = F) |>
          clean_names()

        # Clean up
        file_delete(tmp_file)
      } else {
        # No repeat headers, read normally
        data <- read_csv(path, guess_max = Inf, show_col_types = F) |>
          clean_names()
      }

      data
    }),
    # add nrows per file to ctd_data
    nrows = map_int(data, ~ if (is.null(.x)) 0 else nrow(.x))
  )

d_csv |>
  arrange(cruise_id, file_csv) |>
  select(cruise_id, file_csv, nrows) |>
  dt(
    caption = "Number of rows read per file",
    fname = "ctd_rows_per_file"
  ) |>
  # big marks for thousands
  formatCurrency("nrows", currency = "", digits = 0, mark = ",")

7 Transform Data : Detect and Correct Column Type Mismatches

Code

# Detect expected types from the most common type across all files
d_csv <- d_csv |>
  mutate(
    # Identify completely empty columns in each data frame
    col_empties = map(data, \(x) {
      tibble(
        col_name = names(x),
        n_empty = map_int(x, \(col) sum(is.na(col)))
      ) |>
        filter(n_empty == nrow(x)) |>
        pull(col_name)
    }),
    # Extract column types from each data frame
    col_types = map2(data, col_empties, \(x, y) {
      tibble(
        col_name = names(x),
        col_type = map_chr(x, \(col) class(col)[1])
      ) |>
        filter(!col_name %in% y)
    })
  )

# Find the most common type for each column across all files
d_types <- d_csv |>
  select(path, col_types) |>
  unnest(col_types) |>
  count(col_name, col_type) |>
  group_by(col_name) |>
  slice_max(n, n = 1, with_ties = FALSE) |>
  ungroup() |>
  select(col_name, expected_type = col_type)

# Identify files with type mismatches
d_mismatches <- d_csv |>
  select(cruise_id, path, col_types) |>
  unnest(col_types) |>
  left_join(d_types, by = "col_name") |>
  filter(col_type != expected_type) |>
  arrange(col_name, path)

# Log mismatches if any exist
if (nrow(d_mismatches) > 0) {
  message("Type mismatches detected - converting columns...")
}

# Bind data, converting mismatched columns to expected type
d_bind <- d_csv |>
  mutate(
    data = map2(data, path, \(x, p) {
      # Get columns that need type conversion for this file
      x_mismatches <- d_mismatches |>
        filter(path == p)

      if (nrow(x_mismatches) > 0) {
        for (i in 1:nrow(x_mismatches)) {
          col <- x_mismatches$col_name[i]
          expected <- x_mismatches$expected_type[i]

          # Count NAs before conversion
          na_before <- sum(is.na(x[[col]]))

          # Convert to expected type
          suppressWarnings({
            x[[col]] <- switch(
              expected,
              "numeric" = as.numeric(x[[col]]),
              "integer" = as.integer(x[[col]]),
              "logical" = as.logical(x[[col]]),
              "character" = as.character(x[[col]]),
              x[[col]]
            )
          })

          # Count NAs after conversion
          na_after <- sum(is.na(x[[col]]))
          na_generated <- na_after - na_before

          if (na_generated > 0) {
            message(glue(
              "  {basename(p)}: {col} ({x_mismatches$col_type[i]} → {expected}) generated {na_generated} NAs"
            ))
          }

          # Store NA count in d_mismatches for reporting
          d_mismatches[
            d_mismatches$path == p & d_mismatches$col_name == col,
            "nas_generated"
          ] <<- na_generated
        }
      }
      x
    })
  ) |>
  unnest(data)

# Report on type mismatches with NA generation
if (nrow(d_mismatches) > 0) {
  # Summary by column
  d_mismatches |>
    group_by(col_name, expected_type, col_type) |>
    summarize(
      n_files = n(),
      total_nas = sum(nas_generated, na.rm = TRUE),
      files = paste(basename(path), collapse = "; "),
      .groups = "drop"
    ) |>
    arrange(desc(total_nas)) |>
    dt(
      caption = "Type mismatches by column",
      fname = "ctd_type_mismatches_by_column"
    ) |>
    formatCurrency(
      c("n_files", "total_nas"),
      currency = "",
      digits = 0,
      mark = ","
    )

  # Summary by file
  d_mismatches |>
    mutate(
      path_csv = basename(path),
      col_expr = glue(
        "{col_name}: {col_type} → {expected_type} ({nas_generated} NAs)<br>"
      )
    ) |>
    arrange(path_csv, col_name) |>
    group_by(path_csv) |>
    summarize(
      n_columns = n(),
      total_nas = sum(nas_generated, na.rm = TRUE),
      columns = paste(col_expr, collapse = "\n"),
      .groups = "drop"
    ) |>
    arrange(desc(total_nas), path_csv) |>
    dt(
      caption = "Type mismatches by file",
      fname = "ctd_type_mismatches_by_file",
      escape = FALSE
    ) |>
    formatCurrency(
      c("n_columns", "total_nas"),
      currency = "",
      digits = 0,
      mark = ","
    )
}

Code

# TODO: drop study / replace with cruise_id
# d_bind$study |> table(useNA = "ifany")
# d_bind$cruise_id |> table(useNA = "ifany") # NAs: 10,9782
d_bind |>
  filter(
    !is.na(study),
    !is.na(cruise_id),
    cruise_id != study
  ) |>
  distinct(study, cruise_id, path_unzip) |>
  dt(
    caption = "Mismatch between study and cruise_id",
    fname = "ctd_study_cruise_mismatch"
  )

Code

#   study  cruise_id path_unzip
#   <chr>  <chr>     <chr>
# 1 0404JD 0404NHJD  20-0404NHJD_CTDFinalQC/db_csvs/20-0404JD_CTDBTL_067-094ID.csv
# 2 0404JD 0404NHJD  20-0404NHJD_CTDFinalQC/db_csvs/20-0404JD_CTDBTL_067-094IU.csv
# 3 0404NH 0404NHJD  20-0404NHJD_CTDFinalQC/db_csvs/20-0404NH_CTDBTL_001-066ID.csv
# 4 0404NH 0404NHJD  20-0404NHJD_CTDFinalQC/db_csvs/20-0404NH_CTDBTL_001-066IU.csv
# 5 Study  0711NH    20-0711NH_CTDFinalQC/db-csvs/20-0711_CTDBTL_001-067Dno001b.csv
#
# update cruise_id to study where mismatched, except if study is "Study"
d_bind <- d_bind |>
  mutate(
    cruise_id = if_else(
      !is.na(study) &
        cruise_id != study &
        study != "Study",
      study,
      cruise_id
    ),
    source_file = path_unzip
  ) |>
  relocate(source_file, .after = cruise_id) |>
  select(
    -path_unzip,
    -path,
    -file_csv,
    -dir_unzip,
    -priority,
    -project,
    -study,
    -nrows,
    -col_empties,
    -col_types
  )

# d_bind |>
#   filter(is.na(ord_occ)) |>
#   pull(cast_id) |>
#   table()
# 0707_037d 0707_037u 0707_072d 0707_072u 0907_026d 0907_026u 0907_027d 0907_027u
#      1024      1024       460       460       518       518       277       277
# update ord_occ from cast_id where missing
d_bind <- d_bind |>
  mutate(
    ord_occ = if_else(
      is.na(ord_occ) & !is.na(cast_id),
      str_extract(cast_id, "_(\\d{3})", group = 1),
      ord_occ
    )
  )
# is.na(d_bind$event_num) |> table()
#     FALSE       TRUE
# 3,927,815  2,881,340
# TODO: figure out missing ord_occ

8 Format Date-Time Column

Code

# apply formats depending on pattern
d_bind <- d_bind |>
  mutate(
    date_time_utc = trim(date_time_utc),
    date_time_format = case_when(
      str_detect(
        date_time_utc, # eg "25-Jan-1998 13:01:20"
        "^\\d{1,2}-[A-z]{3}-\\d{4} \\d{1,2}:\\d{1,2}:\\d{1,2}$"
      ) ~ "dmy_hms",
      str_detect(
        date_time_utc, # eg "01/25/1998 13:01", "1/25/1998 13:1"
        "^\\d{1,2}/\\d{1,2}/\\d{4} \\d{1,2}:\\d{1,2}$"
      ) ~ "mdy_hm",
      TRUE ~ "unknown"
    ),
    dtime_utc = NA, # _POSIXct_, # 163,181 | 6,596,130 failed to parse.
    # TODO: add summary of datetime formats by dataset be
    dtime_utc = if_else(
      str_detect(
        date_time_utc, # eg "25-Jan-1998 13:01:20"
        "^\\d{1,2}-[A-z]{3}-\\d{4} \\d{1,2}:\\d{1,2}:\\d{1,2}$"
      ),
      suppressWarnings(dmy_hms(date_time_utc, tz = "UTC")),
      dtime_utc
    ),
    dtime_utc = if_else(
      str_detect(
        date_time_utc, # eg "01/25/1998 13:01", "1/25/1998 13:1"
        "^\\d{1,2}/\\d{1,2}/\\d{4} \\d{1,2}:\\d{1,2}$"
      ),
      suppressWarnings(mdy_hm(date_time_utc, tz = "UTC")),
      dtime_utc
    )
  ) |>
  relocate(date_time_format, dtime_utc, .after = date_time_utc) |>
  # select(-date_time_utc, -date_time_pst) |>
  arrange(dtime_utc, depth)

stopifnot(all(!is.na(d_bind$dtime_utc)))

write_rds(d_bind, bind_prelonlat_rds, compress = "gz")
file_info(bind_prelonlat_rds) |> pull(size) # 657M

658M

Code

d_bind |>
  group_by(
    cruise_id,
    source_file,
    date_time_format,
  ) |>
  summarize(
    n_records = n(),
    dtime_utc_min = min(dtime_utc),
    dtime_utc_max = max(dtime_utc),
    .groups = "drop"
  ) |>
  arrange(cruise_id, source_file, date_time_format) |>
  # pivot date_time_format wider
  pivot_wider(
    names_from = date_time_format,
    values_from = n_records,
    values_fill = NA_real_
  ) |>
  relocate(dmy_hms, mdy_hm, .after = cruise_id) |>
  dt(
    caption = "Date-Time formats detected by cruise_id and source_file",
    fname = "ctd_datetime_formats"
  ) |>
  formatDate(c("dtime_utc_min", "dtime_utc_max"), method = "toLocaleString")

9 Assign Pseudo-NA values to Longitude and Latitude and Fill Missing Coordinates from Line/Station

Code

d_bind <- read_rds(bind_prelonlat_rds)

# summary(d_bind_cr$lon_dec)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#  -124.3  -122.5  -120.9  -114.1  -119.0     0.0

# d_bind_cr |>
#   filter(lon_dec > -100) |>
#   pull(lon_dec) |>
#   unique() |>
#   print() # -9.99e-29

# d_bind_cr |>
#   filter(lon_dec == -99) |>
#   select(cruise_id, source_file, line, sta, lon_dec, lat_dec)

pseudoNA_values <- c(-9.99e-29, -99)

d_bind_sp <- d_bind |>
  mutate(
    is_lnsta_pseudoNA = ifelse(
      as.numeric(line) == 0 & as.numeric(sta) == 0,
      TRUE,
      FALSE
    ),
    is_lon_pseudoNA = map_lgl(lon_dec, ~ some(pseudoNA_values, near, .x)),
    is_lat_pseudoNA = map_lgl(lat_dec, ~ some(pseudoNA_values, near, .x)),
    lon_dec = if_else(
      is_lon_pseudoNA | is_lat_pseudoNA,
      NA_real_,
      lon_dec
    ),
    lat_dec = if_else(
      is_lon_pseudoNA | is_lat_pseudoNA,
      NA_real_,
      lat_dec
    ),
    lon_lnst = if_else(
      !is.na(line) & !is.na(sta) & !is_lnsta_pseudoNA,
      as.numeric(sf_project(
        from = "+proj=calcofi",
        to = "+proj=longlat +datum=WGS84",
        pts = cbind(x = as.numeric(line), y = as.numeric(sta))
      )[, 1]),
      NA_real_
    ),
    lat_lnst = if_else(
      !is.na(line) & !is.na(sta) & !is_lnsta_pseudoNA,
      as.numeric(sf_project(
        from = "+proj=calcofi",
        to = "+proj=longlat +datum=WGS84",
        pts = cbind(x = as.numeric(line), y = as.numeric(sta))
      )[, 2]),
      NA_real_
    ),
    lon_dec = if_else(is.na(lon_dec), lon_lnst, lon_dec),
    lat_dec = if_else(is.na(lat_dec), lat_lnst, lat_dec)
  )

stopifnot(sum(is.na(d_bind_sp$lon_dec)) == 0)
stopifnot(sum(is.na(d_bind_sp$lat_dec)) == 0)

d_bind_sp |>
  filter(
    is_lon_pseudoNA | is_lat_pseudoNA
  ) |>
  group_by(cruise_id) |>
  summarize(
    n_lon_pseudoNA = sum(is_lon_pseudoNA),
    n_lat_pseudoNA = sum(is_lat_pseudoNA)
  ) |>
  dt(
    caption = glue(
      "Cruises with pseudo-NAs ({paste(pseudoNA_values, collapse = ',')}) explicitly set to NA before applying longitude and latitude converted from line, station coordinates (except where line & station are 0, which is also treated as a pseudo-NA)."
    ),
    fname = "ctd_cruises_pseudoNA_lonlat"
  ) |>
  formatCurrency(
    c("n_lon_pseudoNA", "n_lat_pseudoNA"),
    currency = "",
    digits = 0,
    mark = ","
  )

10 View Cruise with Excess Distance Points

Code

max_dist_dec_lnst_km <- 10

pts <- d_bind_sp |>
  st_as_sf(coords = c("lon_dec", "lat_dec"), remove = F, crs = 4326) |>
  mutate(
    geom_lnst = purrr::map2(lon_lnst, lat_lnst, \(x, y) st_point(c(x, y))) |>
      st_sfc(crs = 4326),
    dist_dec_lnst_km = st_distance(geometry, geom_lnst, by_element = T) |>
      units::set_units(km) |>
      units::drop_units(),
    is_dist_dec_lnst_within_max = dist_dec_lnst_km <= max_dist_dec_lnst_km,
  ) |>
  st_join(
    calcofi4r::cc_grid |>
      select(grid_site = sta_key),
    join = st_intersects
  )
st_agr(pts) <- "constant"

# visualize a cruise with points exceeding max distance
badcr_id <- "1507OC"
pts_badcr <- pts |>
  filter(cruise_id == badcr_id)

bb_badcr <- st_bbox(pts_badcr) |>
  st_as_sfc()

mapView(bb_badcr) +
  mapView(
    pts_badcr |>
      filter(is_dist_dec_lnst_within_max) |>
      slice_sample(n = 1000) |> # samples up to 1000 per cruise
      bind_rows(
        pts_badcr |>
          filter(!is_dist_dec_lnst_within_max)
      ) |>
      select(
        cruise_id,
        dtime_utc,
        lon_dec,
        lat_dec,
        sta_id,
        line,
        sta,
        dist_dec_lnst_km,
        is_dist_dec_lnst_within_max
      ),
    layer.name = glue(
      "Cruise {badcr_id}<br>
       distance (km) <br>
       lon/lat to line/station"
    ),
    # zcol = "cruise_id",
    zcol = "dist_dec_lnst_km",
    cex = 5,
    alpha = 0.5
  )

11 Filter Points by Distance from Line/Station Locations

Code

# generate a table by cruise_id for n_all, n_within_max, n_exceeding_max, and pct_exceeding_max
d_pts_cruise_filt_smry <- pts |>
  st_drop_geometry() |>
  group_by(cruise_id) |>
  # filter groups without any points exceeding max distance
  filter(
    any(!is_dist_dec_lnst_within_max)
  ) |>
  summarize(
    n_all = n(),
    n_outside_grid = sum(is.na(grid_site)),
    pct_outside_grid = n_outside_grid / n_all,
    n_gt_cutoff = sum(!is_dist_dec_lnst_within_max, na.rm = T),
    avg_dist_gt_cutoff_km = if_else(
      n_gt_cutoff > 0,
      mean(dist_dec_lnst_km[!is_dist_dec_lnst_within_max], na.rm = T),
      NA_real_
    ),
    # min_dist_gt_cutoff_km = if_else(
    #   n_gt_cutoff > 0,
    #   min(dist_dec_lnst_km[!is_dist_dec_lnst_within_max], na.rm = T),
    #   NA_real_
    # ),
    # max_dist_gt_cutoff_km = if_else(
    #   n_gt_cutoff > 0,
    #   max(dist_dec_lnst_km[!is_dist_dec_lnst_within_max], na.rm = T),
    #   NA_real_
    # ),
    pct_gt_cutoff = n_gt_cutoff / n_all,
    n_rm = sum(!is_dist_dec_lnst_within_max | is.na(grid_site), na.rm = T),
    pct_rm = n_rm / n_all,
    .groups = "drop"
  ) |>
  filter(n_rm > 0) |>
  arrange(desc(pct_rm))

d_pts_cruise_filt_smry |> head() |> names()

[1] "cruise_id"             "n_all"                 "n_outside_grid"       
[4] "pct_outside_grid"      "n_gt_cutoff"           "avg_dist_gt_cutoff_km"
[7] "pct_gt_cutoff"         "n_rm"                  "pct_rm"

Code

# OLD: swap points exceeding max distance ----
# pts_swap <- pts |>
#   # for points exceeding max distance, swap lon/lat with line/station lon/lat and update geometry to geom_lnst
#   mutate(
#     lon_dec = if_else(
#       !is_dist_dec_lnst_within_max,
#       lon_lnst,
#       lon_dec
#     ),
#     lat_dec = if_else(
#       !is_dist_dec_lnst_within_max,
#       lat_lnst,
#       lat_dec
#     ),
#     geometry = if_else(
#       !is_dist_dec_lnst_within_max,
#       geom_lnst,
#       geometry
#     )
#   )

# filter points ----
pts_filt <- pts |>
  filter(
    is_dist_dec_lnst_within_max,
    !is.na(grid_site)
  )

# write_rds(pts_filt, pts_filt_rds, compress = "gz")
# file_info(pts_filt_rds) |> pull(size) # 684M

# show cruises that had points exceeding max distance
d_pts_cruise_filt_smry |>
  dt(
    caption = HTML(mark(
      text = glue(
        "Cruises with rows filtered out that fell outside the CalCOFI grid (`n_outside_grid`) given by [`calcofi4r::cc_grid`](https://calcofi.io/calcofi4r/articles/calcofi4r.html#grid-by-zone) or exceeded a {max_dist_dec_lnst_km} km cutoff from line/station coordinates (`n_gt_cutoff`), sorted by percentage of rows removed (`pct_rm`)."
      )
    )),
    escape = F,
    fname = "ctd_cruises_distance_from_lnst"
  ) |>
  formatCurrency(
    c(
      "n_all",
      "n_gt_cutoff",
      "avg_dist_gt_cutoff_km",
      "n_outside_grid",
      "n_rm"
      # "min_dist_gt_cutoff_km",
      # "max_dist_gt_cutoff_km",
    ),
    currency = "",
    digits = 0,
    mark = ","
  ) |>
  formatPercentage(
    c(
      "pct_gt_cutoff",
      "pct_outside_grid",
      "pct_rm"
    ),
    digits = 2
  )

12 View Filtered Points

Code

# pts_filt <- read_rds(pts_filt_rds)

# st_bbox(pts_swap)
#       xmin       ymin       xmax       ymax
# -126.52170   29.80726 -115.67477   50.00641
# st_bbox(pts_filt)
#       xmin       ymin       xmax       ymax
# -126.48430   29.80726 -117.27340   37.85147

bb_cr <- st_bbox(pts_filt) |>
  st_as_sfc()

mapView(bb_cr) +
  mapView(
    pts_filt |>
      group_by(cruise_id) |>
      slice_sample(n = 100) |> # samples up to 1000 per cruise
      ungroup() |>
      select(
        cruise_id,
        lon_dec,
        lat_dec,
        sta_id,
        line,
        sta,
        dist_dec_lnst_km,
        dtime_utc
      ),
    zcol = "cruise_id",
    # zcol = "dist_dec_lnst",
    cex = 5,
    alpha = 0.5
  )

Code

# show first and last 100 rows
pts_filt |>
  st_drop_geometry() |>
  select(-geom_lnst) |>
  slice(c(1:10, (n() - 9):n())) |>
  dt(
    caption = glue(
      "First and last 10 rows of filter CTD points data (nrows: {format(nrow(pts_filt), big.mark = ',')})"
    ),
    fname = "ctd_filtered_points_sample20"
  )

12.1 Check for duplicates by datetime and depth

Code

# TODO: look for duplicates by dtime_utc, depth
d_dupes <- pts_filt |>
  st_drop_geometry() |>
  select(dtime_utc, depth) |>
  group_by(dtime_utc, depth) |>
  summarize(n = n(), .groups = "drop") |>
  filter(n > 1) |>
  arrange(desc(n)) #  741,840 × 3

# get number of distinct dtime_utc, depth
n_distinct_dtime_depth <- pts_filt |>
  st_drop_geometry() |>
  select(dtime_utc, depth) |>
  n_distinct() # 5,931,948

d_dupes |>
  slice(c(1:10, (n() - 9):n())) |>
  dt(
    caption = glue(
      "First and last 10 rows of duplicates by dtime_utc, depth (nrows = {format(nrow(d_dupes), big.mark = ',')}; n_distinct_dtime_depth = {format(n_distinct_dtime_depth, big.mark = ',')})"
    ),
    fname = "ctd_duplicates_by_datetime_depth_n20"
  )

13 Ingest into DuckDB, Write to `ctd.parquet`

Code

# TODO: pts_filt → parquet

# write to parquet
d_pts_filt <- pts_filt |>
  select(-geom_lnst) |>
  st_drop_geometry()

dbWriteTable(con, "ctd", d_pts_filt, overwrite = T)
# dbListTables(con)

message(glue("Ingested {nrow(d_pts_filt)} rows into DuckDB table 'ctd'"))
# old: 6,813,712; new: 6,727,991

# write point_geom as geometry column
message(glue(
  "Adding spatially indexed column point_geom from lon_dec and lat_dec..."
))
dbExecute(con, "ALTER TABLE ctd ADD COLUMN point_geom GEOMETRY;")

[1] 0

Code

dbExecute(con, "UPDATE ctd SET point_geom = ST_Point(lon_dec, lat_dec);")

[1] 6727991

Code

dbExecute(
  con,
  "CREATE INDEX idx_ctd_point_geom ON ctd USING RTREE(point_geom);"
)

[1] 0

Code

# TODO: write ctd as parquet
message(glue(
  "Writing DuckDB table 'ctd' to Parquet at {ctd_pq}..."
))

dbExecute(
  con,
  glue::glue(
    "COPY ctd TO '{ctd_pq}' (FORMAT 'parquet', COMPRESSION zstd, OVERWRITE_OR_IGNORE);"
  )
)

[1] 6727991

Code

# PARTITION_BY (cruise_id)
# READ: 'orders/*/*/*.parquet', hive_partitioning = true

# TODO: add key-value metadata
# # KV_METADATA {
#         number: 'Answer to life, universe, and everything',
#         is_even: 'not ''odd''' -- single quotes in values must be escaped
#     }

# SELECT * FROM read_parquet('test.parq')

# disconnect
dbDisconnect(con, shutdown = TRUE)
rm(con)

14 Check `ctd.parquet` for spatial functionality in DuckDB

Code

# Create new database connection
con <- dbConnect(duckdb::duckdb()) # dbListTables(con)

# Load spatial extension
dbExecute(con, "INSTALL spatial;")

[1] 0

Code

dbExecute(con, "LOAD spatial;")

[1] 0

Code

# Read parquet file into DuckDB
message(glue("Reading parquet file from {ctd_pq}..."))
dbExecute(
  con,
  glue("CREATE VIEW ctd AS SELECT * FROM read_parquet('{ctd_pq}');")
)

[1] 0

Code

# Verify table structure and row count
row_count <- dbGetQuery(con, "SELECT COUNT(*) as n FROM ctd")$n
message(glue("Reading {row_count} rows from parquet"))

# Check column names and types
cols <- dbGetQuery(con, "DESCRIBE ctd;")
print(cols)

                   column_name column_type null  key default extra
1                    cruise_id     VARCHAR  YES <NA>    <NA>  <NA>
2                  source_file     VARCHAR  YES <NA>    <NA>  <NA>
3                   data_stage     VARCHAR  YES <NA>    <NA>  <NA>
4                     cast_dir     VARCHAR  YES <NA>    <NA>  <NA>
5                      ord_occ     VARCHAR  YES <NA>    <NA>  <NA>
6                    event_num      DOUBLE  YES <NA>    <NA>  <NA>
7                      cast_id     VARCHAR  YES <NA>    <NA>  <NA>
8                date_time_utc     VARCHAR  YES <NA>    <NA>  <NA>
9             date_time_format     VARCHAR  YES <NA>    <NA>  <NA>
10                   dtime_utc   TIMESTAMP  YES <NA>    <NA>  <NA>
11               date_time_pst     VARCHAR  YES <NA>    <NA>  <NA>
12                     lat_dec      DOUBLE  YES <NA>    <NA>  <NA>
13                     lon_dec      DOUBLE  YES <NA>    <NA>  <NA>
14                      sta_id     VARCHAR  YES <NA>    <NA>  <NA>
15                        line     VARCHAR  YES <NA>    <NA>  <NA>
16                         sta     VARCHAR  YES <NA>    <NA>  <NA>
17                       depth      DOUBLE  YES <NA>    <NA>  <NA>
18                    pressure      DOUBLE  YES <NA>    <NA>  <NA>
19                        pr_q      DOUBLE  YES <NA>    <NA>  <NA>
20                       temp1      DOUBLE  YES <NA>    <NA>  <NA>
21                      temp1q      DOUBLE  YES <NA>    <NA>  <NA>
22                       temp2      DOUBLE  YES <NA>    <NA>  <NA>
23                      temp2q      DOUBLE  YES <NA>    <NA>  <NA>
24                    temp_ave      DOUBLE  YES <NA>    <NA>  <NA>
25                       salt1      DOUBLE  YES <NA>    <NA>  <NA>
26                      salt1q      DOUBLE  YES <NA>    <NA>  <NA>
27                  salt1_corr      DOUBLE  YES <NA>    <NA>  <NA>
28                       salt2      DOUBLE  YES <NA>    <NA>  <NA>
29                      salt2q      DOUBLE  YES <NA>    <NA>  <NA>
30                  salt2_corr      DOUBLE  YES <NA>    <NA>  <NA>
31               salt_ave_corr      DOUBLE  YES <NA>    <NA>  <NA>
32                         ox1      DOUBLE  YES <NA>    <NA>  <NA>
33                        ox1q      DOUBLE  YES <NA>    <NA>  <NA>
34             ox1_cruise_corr      DOUBLE  YES <NA>    <NA>  <NA>
35                ox1_sta_corr      DOUBLE  YES <NA>    <NA>  <NA>
36                         ox2      DOUBLE  YES <NA>    <NA>  <NA>
37                        ox2q      DOUBLE  YES <NA>    <NA>  <NA>
38             ox2_cruise_corr      DOUBLE  YES <NA>    <NA>  <NA>
39                ox2_sta_corr      DOUBLE  YES <NA>    <NA>  <NA>
40             ox_ave_sta_corr      DOUBLE  YES <NA>    <NA>  <NA>
41                      ox1u_m      DOUBLE  YES <NA>    <NA>  <NA>
42          ox1u_m_cruise_corr      DOUBLE  YES <NA>    <NA>  <NA>
43             ox1u_m_sta_corr      DOUBLE  YES <NA>    <NA>  <NA>
44                      ox2u_m      DOUBLE  YES <NA>    <NA>  <NA>
45          ox2u_m_cruise_corr      DOUBLE  YES <NA>    <NA>  <NA>
46             ox2u_m_sta_corr      DOUBLE  YES <NA>    <NA>  <NA>
47          ox_aveu_m_sta_corr      DOUBLE  YES <NA>    <NA>  <NA>
48                     fluor_v      DOUBLE  YES <NA>    <NA>  <NA>
49                     fluor_q      DOUBLE  YES <NA>    <NA>  <NA>
50         est_chl_cruise_corr      DOUBLE  YES <NA>    <NA>  <NA>
51            est_chl_sta_corr      DOUBLE  YES <NA>    <NA>  <NA>
52                       isusv      DOUBLE  YES <NA>    <NA>  <NA>
53                       isusq      DOUBLE  YES <NA>    <NA>  <NA>
54         est_no3_cruise_corr      DOUBLE  YES <NA>    <NA>  <NA>
55            est_no3_sta_corr      DOUBLE  YES <NA>    <NA>  <NA>
56               sig_theta_ts1      DOUBLE  YES <NA>    <NA>  <NA>
57              sig_theta_ts1q      DOUBLE  YES <NA>    <NA>  <NA>
58               sig_theta_ts2      DOUBLE  YES <NA>    <NA>  <NA>
59              sig_theta_ts2q      DOUBLE  YES <NA>    <NA>  <NA>
60                         bat      DOUBLE  YES <NA>    <NA>  <NA>
61                      x_miss      DOUBLE  YES <NA>    <NA>  <NA>
62                     trans_q      DOUBLE  YES <NA>    <NA>  <NA>
63                         p_h      DOUBLE  YES <NA>    <NA>  <NA>
64                        p_hq      DOUBLE  YES <NA>    <NA>  <NA>
65                        spar      DOUBLE  YES <NA>    <NA>  <NA>
66                       sparq      DOUBLE  YES <NA>    <NA>  <NA>
67                         par      DOUBLE  YES <NA>    <NA>  <NA>
68                        parq      DOUBLE  YES <NA>    <NA>  <NA>
69                       po_t1      DOUBLE  YES <NA>    <NA>  <NA>
70                       po_t2      DOUBLE  YES <NA>    <NA>  <NA>
71                      dyn_ht      DOUBLE  YES <NA>    <NA>  <NA>
72                         sva      DOUBLE  YES <NA>    <NA>  <NA>
73                     ox_sat1      DOUBLE  YES <NA>    <NA>  <NA>
74                     ox_sat2      DOUBLE  YES <NA>    <NA>  <NA>
75                   btl_depth      DOUBLE  YES <NA>    <NA>  <NA>
76                    btl_temp      DOUBLE  YES <NA>    <NA>  <NA>
77                      salt_b      DOUBLE  YES <NA>    <NA>  <NA>
78                        ox_b      DOUBLE  YES <NA>    <NA>  <NA>
79                     ox_bu_m      DOUBLE  YES <NA>    <NA>  <NA>
80                       chl_a      DOUBLE  YES <NA>    <NA>  <NA>
81                       phaeo      DOUBLE  YES <NA>    <NA>  <NA>
82                         no3      DOUBLE  YES <NA>    <NA>  <NA>
83                         no2      DOUBLE  YES <NA>    <NA>  <NA>
84                         nh4      DOUBLE  YES <NA>    <NA>  <NA>
85                         po4      DOUBLE  YES <NA>    <NA>  <NA>
86                         sil      DOUBLE  YES <NA>    <NA>  <NA>
87           is_lnsta_pseudoNA     BOOLEAN  YES <NA>    <NA>  <NA>
88             is_lon_pseudoNA     BOOLEAN  YES <NA>    <NA>  <NA>
89             is_lat_pseudoNA     BOOLEAN  YES <NA>    <NA>  <NA>
90                    lon_lnst      DOUBLE  YES <NA>    <NA>  <NA>
91                    lat_lnst      DOUBLE  YES <NA>    <NA>  <NA>
92            dist_dec_lnst_km      DOUBLE  YES <NA>    <NA>  <NA>
93 is_dist_dec_lnst_within_max     BOOLEAN  YES <NA>    <NA>  <NA>
94                   grid_site     VARCHAR  YES <NA>    <NA>  <NA>
95                  point_geom    GEOMETRY  YES <NA>    <NA>  <NA>

Code

# Test 1: Verify geometry column exists and is valid
message("\n=== Testing geometry column ===")
geom_test <- dbGetQuery(
  con,
  "
  SELECT 
    COUNT(*) as total_rows,
    COUNT(point_geom) as non_null_geoms,
    COUNT(CASE WHEN ST_IsValid(point_geom) THEN 1 END) as valid_geoms
  FROM ctd;
"
)
print(geom_test)

  total_rows non_null_geoms valid_geoms
1    6727991        6727991     6727991

Code

# Test 2: Get spatial extent for all rows
message("\n=== Getting spatial extent ===")
extent <- dbGetQuery(
  con,
  "
  SELECT 
    MIN(ST_X(point_geom)) as min_lon,
    MAX(ST_X(point_geom)) as max_lon,
    MIN(ST_Y(point_geom)) as min_lat,
    MAX(ST_Y(point_geom)) as max_lat,
    COUNT(*) as total_points
  FROM ctd;
"
)
print(extent)

    min_lon   max_lon  min_lat  max_lat total_points
1 -126.4843 -117.2734 29.80726 37.85147      6727991

Code

# Test 3: Sample some geometries and extract coordinates
message("\n=== Sampling geometries ===")
sample_geoms <- dbGetQuery(
  con,
  "
  SELECT 
    ST_X(point_geom) as lon,
    ST_Y(point_geom) as lat,
    ST_AsText(point_geom) as wkt
  FROM ctd
  LIMIT 5;
"
)
print(sample_geoms)

        lon      lat                                           wkt
1 -117.3052 32.95637 POINT (-117.3052132775284 32.956372425933395)
2 -117.3052 32.95637 POINT (-117.3052132775284 32.956372425933395)
3 -117.3052 32.95637 POINT (-117.3052132775284 32.956372425933395)
4 -117.3052 32.95637 POINT (-117.3052132775284 32.956372425933395)
5 -117.3052 32.95637 POINT (-117.3052132775284 32.956372425933395)

Code

# Test 4: Spatial query - find points within a bounding box
# (adjust coordinates to match your data extent)
message("\n=== Testing spatial filter ===")
bbox_query <- dbGetQuery(
  con,
  "
  SELECT COUNT(*) as points_in_bbox
  FROM ctd
  WHERE ST_Within(
    point_geom,
    ST_MakeEnvelope(-125.0, 32.0, -117.0, 35.0)
  );
"
)
print(bbox_query)

  points_in_bbox
1        4333705

Code

# Test 5: Distance calculation - find points near a location
# (adjust reference point coordinates as needed)
message("\n=== Testing distance calculation ===")
distance_test <- dbGetQuery(
  con,
  "
  SELECT 
    ST_X(point_geom) as lon,
    ST_Y(point_geom) as lat,
    ST_Distance_Sphere(
      point_geom,
      ST_Point(-120.0, 34.0)::GEOMETRY
    ) as distance_m
  FROM ctd
  ORDER BY distance_m
  LIMIT 5;
"
)
print(distance_test)

        lon      lat distance_m
1 -119.9690 34.14224   8621.616
2 -120.0403 33.80340  11819.949
3 -120.1153 33.87136  14689.724
4 -120.1154 33.87142  14693.923
5 -120.1156 33.87157  14707.348

Code

# Test 6: Verify spatial index exists (if recreating)
message("\n=== Checking indexes ===")
indexes <- dbGetQuery(
  con,
  "
  SELECT * FROM duckdb_indexes() 
  WHERE table_name = 'ctd';
"
)
print(indexes)

 [1] database_name database_oid  schema_name   schema_oid    index_name   
 [6] index_oid     table_name    table_oid     comment       tags         
[11] is_unique     is_primary    expressions   sql          
<0 rows> (or 0-length row.names)

Code

# Optional: Recreate spatial index if needed
# dbExecute(con, "DROP INDEX IF EXISTS idx_ctd_point_geom;")
# dbExecute(con, "CREATE INDEX idx_ctd_point_geom ON ctd USING RTREE(point_geom);")

# Test 7: Read into R as sf object
message("\n=== Reading as sf object ===")
sf_ctd <- st_read(
  con,
  query = "SELECT * EXCLUDE (point_geom), ST_AsWKB(point_geom) AS geom FROM ctd USING SAMPLE 50;",
  geometry_column = "geom"
) |>
  st_set_crs(4326)
mapView(sf_ctd)

Code

print(sf_ctd)

Simple feature collection with 50 features and 94 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -124.9044 ymin: 30.17434 xmax: -117.524 ymax: 36.4498
Geodetic CRS:  WGS 84
First 10 features:
   cruise_id                                                        source_file
1     9802JD          19-9802JD_CTDFinalQC/db_csv/19-9802JD_CTDBTL_001-066D.csv
2     9804JD          19-9804JD_CTDFinalQC/db_csv/19-9804JD_CTDBTL_001-085U.csv
3     9807NH          19-9807NH_CTDFinalQC/db_csv/20-9807NH_CTDBTL_001-070U.csv
4     0307NH        20-0307NH_CTDFinalQC/db_csvs/20-0307NH_CTDBTL_001-526ID.csv
5     0401JD        20-0401JD_CTDFinalQC/db_csvs/20-0401JD_CTDBTL_001-098IU.csv
6     0407JD 20-0407JD_CTDFinalQC/db_csvs/pieces/20-0407JD_CTDBTL_013-028ID.csv
7     0511NH         20-0511NH_CTDFinalQC/db_csvs/20-0511NH_CTDBTL_001-072U.csv
8     0607NH        20-0607NH_CTDFinalQC/db_csvs/20-0607NH_CTDBTL_001-074ID.csv
9     0607NH        20-0607NH_CTDFinalQC/db_csvs/20-0607NH_CTDBTL_001-074IU.csv
10    0701JD         20-0701JD_CTDFinalQC/db_csvs/20-0701JD_CTDBTL_001-084D.csv
   data_stage cast_dir ord_occ event_num   cast_id        date_time_utc
1       final        D     024        NA 9802_024d 28-Jan-1998 16:40:40
2       final        U     062        NA 9804_062u 17-Apr-1998 05:47:07
3       final        U     046        NA 9807_046u 20-Jul-1998 01:21:03
4       final        D     021        NA 0307_021d 21-Jul-2003 20:44:31
5       final        U     043        NA 0401_043u 15-Jan-2004 06:44:29
6       final        D     026        NA 0407_026d 18-Jul-2004 15:11:51
7       final        U      13       138 0511_013u 06-Nov-2005 15:37:22
8       final        D      11       123 0607_011d 09-Jul-2006 22:35:50
9       final        U      24       270 0607_024u 13-Jul-2006 11:08:00
10      final        D     080        NA 0701_080d 01-Feb-2007 00:47:48
   date_time_format           dtime_utc        date_time_pst  lat_dec   lon_dec
1           dmy_hms 1998-01-28 16:40:40 28-Jan-1998 08:40:40 32.91795 -118.9354
2           dmy_hms 1998-04-17 05:47:07 16-Apr-1998 21:47:07 34.38824 -122.2461
3           dmy_hms 1998-07-20 01:21:03 19-Jul-1998 17:21:03 33.57883 -120.7543
4           dmy_hms 2003-07-21 20:44:31 21-Jul-2003 12:44:31 32.65400 -119.4839
5           dmy_hms 2004-01-15 06:44:29 14-Jan-2004 22:44:29 32.57502 -122.8083
6           dmy_hms 2004-07-18 15:11:51 18-Jul-2004 07:11:51 33.18666 -118.3906
7           dmy_hms 2005-11-06 15:37:22 06-Nov-2005 07:37:22 31.17822 -120.9234
8           dmy_hms 2006-07-09 22:35:50 09-Jul-2006 14:35:50 31.84474 -119.5712
9           dmy_hms 2006-07-13 11:08:00 13-Jul-2006 03:08:00 32.41817 -119.9526
10          dmy_hms 2007-02-01 00:47:48 31-Jan-2007 16:47:48 36.28500 -123.1338
        sta_id  line   sta depth pressure pr_q   temp1 temp1q   temp2 temp2q
1  090.0 045.0 090.0 045.0    54   54.386   NA 14.5825     NA      NA      9
2  076.7 070.0 076.7 070.0   216  217.630   NA  8.3489     NA      NA      9
3  083.3 060.0 083.3 060.0   262  264.042   NA  7.2203     NA  7.2234     NA
4  090.0 053.0 090.0 053.0   143  144.062   NA  8.6609     NA  8.6637     NA
5  083.3 090.0 083.3 090.0   450  453.676   NA  6.2188     NA  6.2388     NA
6  090.0 037.0 090.0 037.0   321  323.537   NA  7.9567     NA  7.9556     NA
7  093.3 080.0 093.3 080.0    49   49.346   NA 16.7176     NA 16.7860     NA
8  093.3 060.0 093.3 060.0   330  332.579   NA  7.8804     NA  7.8814     NA
9  090.0 060.0 090.0 060.0   275  277.125   NA  7.6445     NA  7.6529     NA
10 066.7 065.0 066.7 065.0   189  190.484   NA  9.0389     NA  9.0379     NA
   temp_ave   salt1 salt1q salt1_corr    salt2 salt2q salt2_corr salt_ave_corr
1   14.5825 33.4741     NA    33.4735       NA      9        NaN       33.4735
2    8.3489 33.9528     NA    33.9555       NA      9        NaN       33.9555
3    7.2219 34.1469     NA    34.0964  34.0889     NA    34.1000       34.0982
4    8.6623 34.0237     NA    34.0142  34.0130     NA    34.0142       34.0142
5    6.2288 34.1849     NA    34.1864  34.2283     NA    34.1938       34.1901
6    7.9562 34.2339     NA    34.2271  34.2220     NA    34.2254       34.2263
7   16.7518 33.1942     NA    33.1902 -99.0000     NA   -99.0000       33.1902
8    7.8809 34.3035     NA    34.3016  34.2998     NA    34.3000       34.3008
9    7.6487 34.1328     NA    34.1309  34.1318     NA    34.1320       34.1315
10   9.0384 34.0310     NA    34.0297  34.0271     NA    34.0271       34.0284
       ox1 ox1q ox1_cruise_corr ox1_sta_corr ox2 ox2q ox2_cruise_corr
1  5.29362   NA          5.2008       4.9683  NA   NA              NA
2  3.66973   NA          3.6046       3.6235  NA   NA              NA
3  1.60075   NA          1.7236       1.7668  NA   NA              NA
4  2.37240   NA          2.5402       2.4866 -99   NA              NA
5  0.71760   NA          0.6927       0.7347 -99   NA              NA
6  1.09900   NA          1.2404       1.1762 -99   NA              NA
7  5.06197   NA          5.8090       5.7728  NA   NA              NA
8  0.71680   NA          0.6977       0.6795  NA   NA              NA
9  1.45010   NA          1.4839       1.4763  NA   NA              NA
10 2.09240   NA          2.5168       2.5701  NA   NA              NA
   ox2_sta_corr ox_ave_sta_corr   ox1u_m ox1u_m_cruise_corr ox1u_m_sta_corr
1            NA              NA 230.6721           224.1335              NA
2            NA              NA 159.6744           156.8340              NA
3            NA              NA  69.6290            74.9854              NA
4            NA          2.5402 103.2250           110.5257        111.2110
5            NA          0.6927  31.2090            30.1267         32.6535
6            NA          1.2404  47.8060            53.9593         57.7315
7            NA          5.8090 220.7270           253.2966        254.5665
8            NA              NA  31.1780            30.3458         33.1341
9            NA              NA  63.0800            64.5480         66.7819
10           NA              NA  91.0470           109.5162        113.7729
   ox2u_m ox2u_m_cruise_corr ox2u_m_sta_corr ox_aveu_m_sta_corr fluor_v fluor_q
1      NA                 NA              NA                 NA  0.7077      NA
2      NA                 NA              NA                 NA  0.0078      NA
3      NA                 NA              NA                 NA  0.1330      NA
4     -99                 NA              NA                 NA  0.0024      NA
5     -99                 NA              NA                 NA  0.0035      NA
6     -99                 NA              NA                 NA  0.0045      NA
7      NA                 NA              NA                 NA  0.0473      NA
8      NA                 NA              NA                 NA  0.0258      NA
9      NA                 NA              NA                 NA  0.0244      NA
10     NA                 NA              NA                 NA  0.0232      NA
   est_chl_cruise_corr est_chl_sta_corr    isusv isusq est_no3_cruise_corr
1               0.4035           0.3672       NA    NA                  NA
2                   NA               NA       NA    NA                  NA
3                   NA               NA       NA    NA                  NA
4               0.0468           0.0000 -99.0000    NA            -99.0000
5                   NA               NA -99.0000    NA            -99.0000
6                   NA               NA -99.0000    NA            -99.0000
7               0.3764           0.2912 -99.0000    NA            -99.0000
8                   NA               NA       NA    NA                  NA
9                   NA               NA       NA    NA                  NA
10              0.0441           0.0033   1.3991    NA             30.3696
   est_no3_sta_corr sig_theta_ts1 sig_theta_ts1q sig_theta_ts2 sig_theta_ts2q
1                NA       24.8872             NA           NaN              9
2                NA       26.4025             NA           NaN              9
3                NA       26.7197             NA       26.6736             NA
4           -99.000       26.4092             NA       26.4003             NA
5           -99.000       26.8858             NA       26.9175             NA
6           -99.000       26.6834             NA       26.6742             NA
7           -99.000       24.1964             NA     1999.0000             NA
8                NA       26.7494             NA       26.7464             NA
9                NA       26.6487             NA       26.6468             NA
10           25.999       26.3564             NA       26.3535             NA
      bat  x_miss trans_q p_h p_hq       spar sparq        par parq   po_t1
1  0.6360 85.2896      NA  NA   NA         NA    NA 4.9037e+00   NA 14.5746
2  0.6600 84.7894      NA  NA   NA         NA    NA 1.3172e-02   NA  8.3266
3  4.0178 44.7732      NA  NA   NA 0.0000e+00    NA 6.4185e-01   NA  7.1954
4  7.4271 22.6408      NA -99   NA 0.0000e+00    NA 6.4000e-01   NA  8.6459
5  0.6404 85.2048      NA -99   NA 1.9497e-03    NA 8.2419e-01   NA  6.1788
6  0.5130 87.9634      NA -99   NA 3.4400e+00    NA 6.4000e-01   NA  7.9242
7  0.7030 83.8825      NA  NA   NA         NA    NA         NA   NA 16.7096
8  0.4931 88.4025      NA  NA   NA 3.8397e+00    NA 6.5970e-01   NA  7.8471
9  0.4980 88.2928      NA  NA   NA 1.6485e+03    NA 6.6153e-01   NA  7.6174
10 0.4136 90.1772      NA  NA   NA 2.6188e+00    NA 9.9990e+03   NA  9.0184
     po_t2 dyn_ht     sva  ox_sat1 ox_sat2 btl_depth btl_temp salt_b ox_b
1      NaN 0.1804 307.117  5.78992      NA        NA       NA     NA   NA
2      NaN 1.0572 165.529       NA      NA        NA       NA     NA   NA
3   7.1985 0.9456 135.744       NA      NA        NA       NA     NA   NA
4   8.6487 0.3545 163.628 36.21900     -99        NA       NA     NA   NA
5   6.1987 0.9838 122.139 10.36700     -99        NA       NA     NA   NA
6   7.9231 0.6844 140.701 16.53400     -99        NA       NA     NA   NA
7  16.6634 1.7589 372.922 91.07802      NA        NA       NA     NA   NA
8   7.8481 0.8955 134.582 10.77100      NA        NA       NA     NA   NA
9   7.6258 1.2332 142.936 21.64700      NA        NA       NA     NA   NA
10  9.0175 0.5721 169.683 32.21800      NA        NA       NA     NA   NA
   ox_bu_m chl_a phaeo no3 no2 nh4 po4 sil is_lnsta_pseudoNA is_lon_pseudoNA
1       NA    NA    NA  NA  NA  NA  NA  NA             FALSE              NA
2       NA    NA    NA  NA  NA  NA  NA  NA             FALSE              NA
3       NA    NA    NA  NA  NA  NA  NA  NA             FALSE           FALSE
4       NA    NA    NA  NA  NA  NA  NA  NA             FALSE           FALSE
5       NA    NA    NA  NA  NA  NA  NA  NA             FALSE           FALSE
6       NA    NA    NA  NA  NA  NA  NA  NA             FALSE           FALSE
7       NA    NA    NA  NA  NA  NA  NA  NA             FALSE           FALSE
8       NA    NA    NA  NA  NA  NA  NA  NA             FALSE           FALSE
9       NA    NA    NA  NA  NA  NA  NA  NA             FALSE           FALSE
10      NA    NA    NA  NA  NA  NA  NA  NA             FALSE           FALSE
   is_lat_pseudoNA  lon_lnst lat_lnst dist_dec_lnst_km
1               NA -118.9354 32.91795       0.00000000
2               NA -122.2461 34.38824       0.00000000
3            FALSE -120.7544 33.57842       0.04594816
4            FALSE -119.4822 32.65128       0.34135134
5            FALSE -122.8118 32.57842       0.50099602
6            FALSE -118.3870 33.18462       0.40621009
7            FALSE -120.9193 31.17971       0.42016439
8            FALSE -119.5715 31.84637       0.18363141
9            FALSE -119.9593 32.41795       0.63166261
10           FALSE -123.1296 36.28696       0.44164887
   is_dist_dec_lnst_within_max grid_site                       geom
1                         TRUE     90,45 POINT (-118.9354 32.91795)
2                         TRUE   76.7,70 POINT (-122.2461 34.38824)
3                         TRUE   83.3,60 POINT (-120.7543 33.57883)
4                         TRUE     90,55   POINT (-119.4839 32.654)
5                         TRUE   83.3,90 POINT (-122.8083 32.57502)
6                         TRUE     90,35 POINT (-118.3906 33.18666)
7                         TRUE   93.3,80 POINT (-120.9234 31.17822)
8                         TRUE   93.3,60 POINT (-119.5712 31.84474)
9                         TRUE     90,60 POINT (-119.9526 32.41817)
10                        TRUE   66.7,70   POINT (-123.1338 36.285)

Code

# Cleanup
# dbDisconnect(con, shutdown = TRUE)

15 Interactive Gantt Chart

Code

# reconnect to database
# con <- dbConnect(duckdb::duckdb(), dbdir = db_path, read_only = T)

# tbl(con, "ctd") |>
#   summarize(n = n())
# 210,507

# tbl(con, "ctd") |>
#   filter(study == "2301RL") |>
#   distinct(study) |>
#   pull(study) |>
#   sort()
# 20 - 2401
# RL_CTDBTL_017 - 033
# U.csv

# tbl(con, "ctd") |>
#   pull(study) |>
#   table(useNA = "ifany")
#  9802JD   9804JD
#  59,005  151,502

# get cruise date spans by cruise and stage
cruise_spans <- tbl(con, "ctd") |>
  # rename(data_stage = data_type) |> # DEBUG
  group_by(cruise_id, data_stage) |>
  summarize(
    beg_date = min(dtime_utc, na.rm = TRUE) |> as.Date(),
    end_date = max(dtime_utc, na.rm = TRUE) |> as.Date(),
    .groups = "drop"
  ) |>
  collect() |>
  filter(!is.na(beg_date), !is.na(end_date))

# Split multi-year cruises into separate rows
cruise_spans <- cruise_spans |>
  rowwise() |>
  mutate(
    years_spanned = list(year(beg_date):year(end_date))
  ) |>
  unnest(years_spanned) |>
  mutate(
    # Adjust dates for each year segment
    year = years_spanned,
    adj_beg_date = if_else(
      years_spanned == year(beg_date),
      beg_date,
      as.Date(paste0(years_spanned, "-01-01"))
    ),
    adj_end_date = if_else(
      years_spanned == year(end_date),
      end_date,
      as.Date(paste0(years_spanned, "-12-31"))
    ),
    begin_jday = yday(adj_beg_date),
    end_jday = yday(adj_end_date),
    stage_label = data_stage,
    hover_text = glue(
      "Cruise: {cruise_id}<br>",
      "Stage: {data_stage}<br>",
      "Begin: {format(beg_date, '%Y-%m-%d')}<br>",
      "End: {format(end_date, '%Y-%m-%d')}<br>",
      "Duration: {as.numeric(difftime(end_date, beg_date, units = 'days'))} days"
    )
  ) |>
  ungroup() |>
  arrange(adj_beg_date)

# create color mapping
colors <- c("final" = "#23d355ff", "preliminary" = "#A23B72")

# create interactive gantt chart
p <- plot_ly()

# add traces for each cruise segment
for (i in 1:nrow(cruise_spans)) {
  row <- cruise_spans[i, ]

  p <- p |>
    add_trace(
      type = "scatter",
      mode = "lines",
      x = c(row$begin_jday, row$end_jday),
      y = c(row$year, row$year),
      line = list(
        color = colors[row$stage_label],
        width = 8
      ),
      text = row$hover_text,
      hoverinfo = "text",
      showlegend = FALSE,
      legendgroup = row$stage_label
    )
}

# add legend traces
p <- p |>
  add_trace(
    type = "scatter",
    mode = "lines",
    x = c(NA, NA),
    y = c(NA, NA),
    line = list(color = colors["final"], width = 8),
    name = "Final",
    showlegend = TRUE
  ) |>
  add_trace(
    type = "scatter",
    mode = "lines",
    x = c(NA, NA),
    y = c(NA, NA),
    line = list(color = colors["preliminary"], width = 8),
    name = "Preliminary",
    showlegend = TRUE
  )

# customize layout
p <- p |>
  layout(
    title = "CalCOFI CTD Cruise Timeline",
    xaxis = list(
      title = "Day of Year",
      range = c(1, 365),
      dtick = 30,
      tickangle = 0
    ),
    yaxis = list(
      title = "Year",
      autorange = "reversed",
      dtick = 1
    ),
    hovermode = "closest",
    plot_bgcolor = "#f8f9fa",
    paper_bgcolor = "white",
    legend = list(
      x = 1.02,
      y = 1,
      xanchor = "left",
      yanchor = "top"
    )
  )

# display chart
p

Code

# disconnect
dbDisconnect(con, shutdown = TRUE) # get cruise date spans by cruise and stage

16 Notes (old)

16.1 Process raw data

into usable form:

Fathom software | Sea-Bird Scientific

References:

CTD Data | CalCOFI archives

Cast Files. Files are created by SeaBird’s Seasave program during each CalCOFI station. These data are unaltered and may contain errors (e.g., incorrect header information) or be incomplete due to system failure during collection. Two or more files may be available for a single station in the latter case. Each cruise will have a single cast data file. This file is a compressed .zip containing .zip files for each cast of the cruise, yymm_CTDCast.zip and yymm_llllssss_###.zip respectively. Each station .zip should contain the following: .bl, .con, .hdr, .hex, .mrk, and .prn files.

CalCOFI CTD System

A Sea-Bird Electronics 911plus V2 CTD collects vertical profile data at every CalCOFI station.
In addition to being a dual TCO (temperature, conductivity, and oxygen) system, the CTD also interfaces with a transmissometer, fluorometer, PAR/SPAR meters, altimeter, nitrate, and pH sensor.
Connected to a shipboard data-acquisition computer through an electronically-conductive winch wire, sensor data are collected and displayed real-time using Seasave V7.
The CTD is normally lowered to terminal depth of 515 m, bottom-depth permitting, but is routinely deployed within meters from the seafloor at nearshore SCCOOS and basin stations.
To ensure high resolution sampling in areas with significant hydrological and biological gradients a speed of ~30 m/min is used for the first 100 m then ~60 m/min to depth without stopping.
During retrieval, the CTD is paused for at least 20 seconds at target bottle depths to adequately flush each 10 liter sample bottle prior to closure. Seawater samples are analyzed onboard (e.g., salinity, oxygen, nutrients, chl) and are used to correct measured CTD values.

Mahmoud & Mas (2022) Standard Operating Procedure: CTD data pre-processing. Ocean Best Practices

@article{mohamed2022standard,
  title={Standard Operating Procedure: CTD data pre-processing.},
  author={Mohamed Mahmoud, Mohamed and Munoz Mas, Cristian},
  year={2022},
  publisher={Institut Mauritanien De Recherches Oc{\'e}anographiques Et De P{\^e}ches (IMROP)}
}

16.2 Evaluate different zip files

2025-Jul:
- Preliminary CTD 1m-Binned
  20-2507SR_CTDPrelim.zip
  Preliminary CTD 1m-Binned: Raw CTD sensor data that have gone through SBE Data Processing, and obvious bad sensor readings have been flagged during initial QAQC. Bottle data has not yet been merged with the CTD data. CTD Salinity, Oxygen, Fluorescence, and Nitrate profiles have not been corrected to the bottle data. Cruise corrected Nitrate sensor data (EstNO3_CruiseCorr) has been converted from raw voltage using a sensor calibration done before each cruise, but not yet corrected using CTD vs. Bottle regressions.
2021-Jul:
- ~~Prodo Cast~~ Raw CTD Data
  20-2107SR_CTDCast.zip
- Preliminary CTD & Bottle 1m-Binned
  20-2107SR_CTDPrelim.zip
  Preliminary CTD & Bottle 1m-Binned: Preliminary CTD sensor data that have been merged with QAQC’ed bottle data. CTD Salinity, Oxygen, Fluorescence, and Nitrate profiles have been corrected using CTD vs. bottle regressions.
2021-May:
- Raw CTD Data
  20-2105SH_CTDCast.zip
  Raw CTD Data: Data files (.xmlcon, .hex, .bl, .nav, .prn, .mrk, .hdr) directly from SBE Software (Sea-Bird Scientific) with no processing or corrections applied.
- Final 1m-Binned
  20-2105SH_CTDFinalQC.zip
  Final 1m-Binned: CTD data that have passed through the CalCOFI quality control process. After standard Seasoft-processing (Sea-Bird Scientific’s data-processing suite) tuned for our SBE 911plus CTD, the .asc files are processed by SIO-CalCOFI in-house software (BtlVsCTD). Averaged (4-sec or 1-meter bin-averaged) CTD sensor data are matched with corresponding bottle data, then cruise-corrected and station-corrected. Zipped file includes: metadata and .asc, .hdr, .btl, and upcast & downcast .csvs (“u” and “d”; by station and concatenated).
  - Cruise-corrected CTD data: CTD sensor data that are corrected using regression coefficients derived from 4-sec average sensor data vs bottle data comparisons for the entire cruise, meaning all casts with bottle samples (n=~1400; fliers omitted; column labels ‘_CruiseCorr’).
  - Station-corrected CTD data: CTD sensor data that are corrected using regression coefficients generated dynamically for each cast, 1m bin-averaged sensor data vs bottle data (n=~20; fliers omitted; column labels ‘_StaCorr’).
    - Since sensor behavior may vary from station to station, station-corrected CTD sensor data for salinity, oxygen, estimated nitrate, and estimated chlorophyll-a are considered the best, particularly for estimated nitrate.
1992-Feb:
- Prodo Cast
  19-9202JD_CTDCastProdoStas.zip
  Test Cast/Prodo Cast: The Seabird 911plus CTD was first tested in Mar 1990 (9003JD) but is still under development. CTD data from Mar 1990 – 1991 were occasional system & feasibility tests (“test casts“). As confidence in data quality improved, the CTD-Rosette began being used for daily primary productivity sample collection casts (“prodo casts“) in 1992 (6 depths from one station per day).
1990-Mar:
- Test Cast
  19-9003JD_CTDTest.zip

1 Overview

1.1 Key Features

1.2 Usage

1.3 Notes

2 Setup

3 Scrape CTD Download Links

4 Download and Unzip Files

5 Find and Prioritize CTD Data Files

6 Read and Standardize CTD Files

7 Transform Data : Detect and Correct Column Type Mismatches

8 Format Date-Time Column

9 Assign Pseudo-NA values to Longitude and Latitude and Fill Missing Coordinates from Line/Station

10 View Cruise with Excess Distance Points

11 Filter Points by Distance from Line/Station Locations

12 View Filtered Points

12.1 Check for duplicates by datetime and depth

13 Ingest into DuckDB, Write to ctd.parquet

14 Check ctd.parquet for spatial functionality in DuckDB

15 Interactive Gantt Chart

16 Notes (old)

16.1 Process raw data

16.2 Evaluate different zip files

13 Ingest into DuckDB, Write to `ctd.parquet`

14 Check `ctd.parquet` for spatial functionality in DuckDB