Publish Larvae to OBIS

Published

2026-01-09

This workflow converts CalCOFI ichthyoplankton (fish eggs and larvae) data to an OBIS-compliant DarwinCore Archive for publication. The archive follows the Event Core structure with Occurrence and ExtendedMeasurementOrFact (eMoF) extensions.

References: CalCOFI Fish Eggs & Larvae | erdCalCOFIlrvcnt | CoastWatch ERDDAP | OBIS Manual | DarwinCore Terms

Code
librarian::shelf(
  CalCOFI / calcofi4db,
  DBI,
  dm,
  dplyr,
  DT,
  EML,
  glue,
  here,
  lubridate,
  purrr,
  readr,
  tibble,
  tidyr,
  uuid,
  xml2,
  quiet = T
)
options(readr.show_col_types = F)

# Dataset metadata
dataset_title <- "CalCOFI Fish Larvae Tows"
dataset_abstract <- "Fish larvae counts and standardized counts for eggs captured in CalCOFI ichthyoplankton nets (primarily vertical [Calvet or Pairovet], oblique [bongo or ring nets], and surface tows [Manta nets]). Surface tows are normally standardized to count per 1,000 m^3 strained. Oblique tows are normally standardized to count per 10 m^2 of surface sampled."
dataset_keywords <- c(
  "atmosphere",
  "biology",
  "biosphere",
  "calcofi",
  "earth science",
  "environment",
  "latitude",
  "longitude",
  "ocean",
  "time"
) # GCMD Science Keywords (source: https://catalog.data.gov/dataset/calcofi-larvae-counts-positive-tows)
# dataset_uuid <- UUIDgenerate() # Or use existing dataset UUID
country_code <- "US"
water_body <- "California Current" # LME: http://marineregions.org/mrgid/8549

dataset_id <- glue(
  "calcofi.io_workflow_larvae_to_obis_{format(Sys.Date(), '%Y-%m-%d')}"
)

sampling_methods_description <- "Standard CalCOFI ichthyoplankton tows. The standard oblique tow uses a bongo net (71 cm diameter, 0.505 mm mesh) or 1-m ring net (prior to 1978) retrieved at a constant wire angle (45 degrees) from 210 m depth to surface. Surface tows use a Manta net; vertical tows use CalVET/Pairovet nets. Flowmeters measure volume filtered. Samples are preserved in 5% formalin. In the lab, fish eggs and larvae are sorted, identified to lowest taxon possible, enumerated, and measured."

study_extent_description <- "The study covers the California Current ecosystem, primarily off Southern California (standard lines 77-93) but historically extending from the border of Canada to the tip of Baja California. The time series for this dataset generally begins in 1951 and continues to the present, with quarterly cruises."

sampling_description <- "Samples are collected at fixed stations along the CalCOFI grid. Oblique tows are standardized to counts per 10 m^2 of sea surface. Surface tows are standardized to counts per 1,000 m^3. Data includes raw counts (tallies) and standardized abundances."

funding_information <- "CalCOFI is a partnership between the NOAA National Marine Fisheries Service (NMFS), Scripps Institution of Oceanography (SIO), and California Department of Fish and Wildlife (CDFW)."

rights <- "The data may be used and redistributed for free but is not intended for legal use, since it may contain inaccuracies. Neither the data Contributor, ERD, NOAA, nor the United States Government, nor any of their employees or contractors, makes any warranty, express or implied, including warranties of merchantability and fitness for a particular purpose, or assumes any legal liability for the accuracy, completeness, or usefulness, of this information."
# source: [erdCalCOFIlrvcnt | CoastWatch ERDDAP](https://coastwatch.pfeg.noaa.gov/erddap/info/erdCalCOFIlrvcnt/index.html)

license <- "To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the Public Domain (CC0 1.0). Users may copy, modify, distribute and use the work, including for commercial purposes, without restriction."
# source: [erdCalCOFIlrvcnt | CoastWatch ERDDAP](https://coastwatch.pfeg.noaa.gov/erddap/info/erdCalCOFIlrvcnt/index.html)
license_xml <- c(
  "Public Domain (CC0 1.0)" = '<ulink url="http://creativecommons.org/publicdomain/zero/1.0/legalcode"><citetitle>Public Domain (CC0 1.0)</citetitle></ulink>'
)

intellectual_rights <- list(
  para = paste(rights, license, collapse = " ")
)

dir_out <- here("data/darwincore/larvae")

# get database connection
schema <- "dev"
con <- get_db_con(schema)

1 Database Schema

The CalCOFI database organizes ichthyoplankton data hierarchically: cruises contain sites, sites contain tows, and tows contain net samples. Biological observations (eggs, larvae) link to net samples via net_uuid.

Code
# exclude experimental tables
tbls <- dbListTables(con) |>
  sort() |>
  setdiff(
    c("schema_version", "grid", "site_seg", "bottle", "cast")
  )

# learn relations from database and draw
dm <- dm_from_con(con, schema = schema, table_names = tbls, learn_keys = T)
dm_draw(dm, view_type = "all")
Figure 1: Entity relationship diagram (ERD) of the CalCOFI database.
erDiagram
    net {
        string net_uuid PK
        string tow_uuid FK
        string side
        float std_haul_factor
        float vol_sampled_m3
        float prop_sorted
        float smallplankton
        float totalplankton
    }
    
    tow {
        string tow_uuid PK
        string site_uuid FK
        string tow_type_key FK
        int tow_number
        datetime time_start
    }
    
    site {
        string site_uuid PK
        string cruise_uuid FK
        string orderocc
        float longitude
        float latitude
        string line
        string station
    }
    
    cruise {
        string cruise_uuid PK
        string ship_key FK
        string date_ym
    }
    
    ship {
        string ship_key PK
        string ship_name
        string ship_nodc
    }
    
    tow_type {
        string tow_type_key PK
        string description
    }
    
    species {
        string species_id PK
        string scientific_name
        string itis_id
        string worms_id
        string common_name
    }
    
    larvastage {
        string net_uuid FK
        string species_id FK
        string stage
        int tally
    }
    
    egg {
        string net_uuid FK
        string species_id FK
        int tally
    }
    
    eggstage {
        string net_uuid FK
        string species_id FK
        string stage
        int tally
    }
    
    larva {
        string net_uuid FK
        string species_id FK
        int tally
    }
    
    larvasize {
        string net_uuid FK
        string species_id FK
        float length_mm
        int tally
    }

    %% Relationships
    net ||--o{ larvastage : "has"
    net ||--o{ egg : "has"
    net ||--o{ eggstage : "has"
    net ||--o{ larva : "has"
    net ||--o{ larvasize : "has"
    
    species ||--o{ larvastage : "identified_as"
    species ||--o{ egg : "identified_as"
    species ||--o{ eggstage : "identified_as"
    species ||--o{ larva : "identified_as"
    species ||--o{ larvasize : "identified_as"
    
    tow ||--o{ net : "uses"
    site ||--o{ tow : "conducted_at"
    cruise ||--o{ site : "visits"
    ship ||--o{ cruise : "operates"
    tow_type ||--o{ tow : "defines"
Figure 2: Entity relationship diagram (ERD) of the CalCOFI database. Created from image above and prompt to Claude: ‘Generate an ERD mermaid diagram from the image.’

2 Life Stage Reference

Egg development stages follow Moser & Ahlstrom (1985). Larva stages include: yolk sac, preflexion, flexion, postflexion, and transformation.

2.0.1 Stage I

Cell division has not yet begun. In intact eggs the cytoplasm of the single cell appears as a clear hemisphere at one pole, easily dif- ferentiated from the yolk mass which is divided into granules. The cytoplasm may be displaced to other locations around the periphery of the yolk mass, but there is usually some accumulation at one pole, which allows the stage to be identified.

2.0.2 Stage II

This begins with the division of the single cell into two cells or blastomeres. The division is first noticeable when a furrow develops in the middle of the cytoplasmic cap. Small bubble-like structures (probably artifacts) are often visible along the furrow and help iden- tify it. The next cleavage plane is at right angles to the first, and subsequent synchronous divisions in both meridional and latitudinal planes produce a hemispherical mound of cells, termed the blasto- disc. After the 5th or 6th division, the blastodisc has a berry-like appearance, the so-called “mulberry stage,” and with subsequent divisions the blastomeres become increasingly smaller and more difficult to distinguish individually. During a certain phase of the early divisions the blastomeres are about the same size as the yolk granules. Ifthe blastodisc and yolk mass become disrupted during collection or preservation, the blastomeres may become distributed among the yolk granules. They may be distinguished from one another since they have different refractive indices and the blastomeres appear darker when viewed with transmitted light.

2.0.3 Stage III

Ahlstrom (1943) defined this stage in sardine eggs as beginning with the appearance of the segmentation cavity. The segmentation cavity of teleost eggs is the space formed between the blastodisc and the yolk mass during late cleavage. In most anchovy eggs in our collec- tions the blastodisc is shrunken and somewhat cup-shaped and con- sequently the segmentation cavity, which is a delicate structure, is obliterated. We have found it preferable to define the beginning of Stage ill on the basis of the appearance of the blastoderm, i.e., when it has the appearance of tissue rather than of a collection of individual cells. This stage marks the beginning of gastrulation. The margin of the blastodisc becomes slightly thickened and is termed the germ ring. At one region of the germ ring the thickening extends inward to form the embryonic shield, which defines the future axis of the embryo. Gastrulation proceeds by further proliferation and downward movement of cells in the region of the germ ring by a process known as epiboly. Simultaneously, proliferation and inward migration (em- boly) of cells from the margin of the embryonic shield produce the organ-forming cell layers of the primordial embryo. At the end of Stage III the germ ring is one-third down the yolk mass and the bilateral nature of the primordial embryo is apparent.

2.0.4 Stage IV

At the beginning of this stage the germ ring has enclosed one-third of the yolk mass and the embryo is beginning to form along the median region of the embryonic shield. At the end of this stage, defined by the germ ring enveloping two-thirds of the yolk, the head region of the embryo is becoming apparent.

2.0.5 Stage V

This stage begins with the germ ring two-thirds down the yolk and ends with the closure of the blastopore and the complete enclosure of the yolk by the cellular sheath of the embryo. The stage is charac- terized by rapid differentiation resulting in the formation of several somites in the midregion of the embryonic axis, development of the notochord which can be seen from a dorsal viewpoint, and differ- entiation of the optic vesicles from the brain.

2.0.6 Stage VI

This stage begins with the closure of the blastopore and ends when the tail starts to separate from the yolk mass. The embryonic sheath of cells is extremely thin and, in some samples, it may be difficult to determine the point of blastopore closure. In these cases the event can be estimated from the position of the caudal terminus of the embryonic axis, since it grows toward the pole with the edge of the cellular sheath. Initially the caudal region lies flat against the polar region of the yolk, then gradually thickens and becomes more round- ed at the tip until it is clearly separate from the yolk. During this stage the somites are apparent along the entire body axis (except at the caudal portion), the lens primordium appears in the eye, and the regions of the brain begin to differentiate.

2.0.7 Stage VII

At the beginning of this stage the tip of the tail free from the yolk is broadly rounded, then begins to narrow as it elongates. The noto- chord extends almost to the tip and the finfold is just becoming visi- ble. At the end of this stage the length of the free tail is one-half the length of the head. For this purpose, head length is considered the distance from the tip of the snout to the back of the cerebellum (see Fig. 2H). Relative tail length is the criterion for each remain- ing stage.

2.0.8 Stage VIII

This stage begins when the free tail length is l:,reater than one-half the head length and ends when tail length equals head length. The tail becomes pointed during this stage and begins to bend away from the axis of the body, to the right or left side. The curvature of the tail generally increases with development, but is subject to individual variability (Fig. 2). Judgement is required in compensating for curva- ture in estimating relative tail length; however, accuracy and preci- sion increase rapidly with practice.

2.0.9 Stage IX

This stage begins with the tail extending one-quarter the length of the yolk sac and ends when it reaches one-half the yolk sac length. The gut is now apparent along the ventral surface of the tail, and its terminal section passes through the fin fold which is now con- siderably wider than in the previous stage. The pectoral fin buds appear as lateral thickenings as do the otic vesicles.

2.0.10 Stage X

This stage starts when the tail is one-half the length of the yolk sac and ends when it reaches three-quarters of the yolk sac length.

2.0.11 Stage XI

This is the final stage before hatching and is defined by a tail length greater than three-quarters of the length of the yolk sac.

Developmental stages of fish larvae captured in CalCOFI ichthyoplankton nets:

  • YOLK: Yolk-sac larva - newly hatched, still absorbing yolk
  • PREF: Preflexion larva - notochord straight, before flexion
  • FLEX: Flexion larva - notochord flexing upward
  • POST: Postflexion larva - notochord flexion complete
  • TRNS: Transformation stage - transitioning to juvenile

3 Conversion Design Decisions

Key design decisions for converting CalCOFI ichthyoplankton data to DarwinCore:

  1. Event hierarchy: cruise → site → tow → net (following OBIS sampling-event model)
  2. Occurrence consolidation: One occurrence per life stage (egg or larva) per species per net sample, rather than separate occurrences for each developmental stage
  3. ExtendedMeasurementOrFact (eMoF): Stage-specific counts, body lengths, and sample-level measurements linked via occurrenceID or eventID
  4. Vocabulary mappings: S11 for life stages, P06 for units, custom descriptive terms for CalCOFI-specific developmental stages
  5. Sample size: vol_sampled_m3 stored as sampleSizeValue in Event core (not in eMoF)

4 Data Extraction from Database

First, let’s extract all necessary tables from the database:

Code
# Get simple lazy table references
tbl_ship <- tbl(con, "ship")
tbl_cruise <- tbl(con, "cruise")
tbl_site <- tbl(con, "site")
tbl_tow <- tbl(con, "tow")
tbl_tow_type <- tbl(con, "tow_type")
tbl_net <- tbl(con, "net")
tbl_species <- tbl(con, "species")
# biological observation tables
tbl_egg <- tbl(con, "egg")
tbl_egg_stage <- tbl(con, "egg_stage")
tbl_larva <- tbl(con, "larva")
tbl_larva_stage <- tbl(con, "larva_stage")
tbl_larva_size <- tbl(con, "larva_size")

5 Create Life Stage Vocabulary Lookup Tables

Define mappings for egg and larva developmental stages to descriptive terms:

Code
# Egg stage vocabulary mapping (Moser & Ahlstrom, 1985)
egg_stage_vocab <- tibble(
  stage = 1:15,
  stage_description = c(
    "egg, stage 1 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 2 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 3 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 4 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 5 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 6 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 7 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 8 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 9 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 10 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 11 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 12 of 11 (Moser & Ahlstrom, 1985)", # TODO: stage 12 to 15 of 11 remove?
    "egg, stage 13 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 14 of 11 (Moser & Ahlstrom, 1985)",
    "egg, stage 15 of 11 (Moser & Ahlstrom, 1985)"
  )
)

# Larva stage vocabulary mapping
larva_stage_vocab <- tibble(
  stage = c("YOLK", "PREF", "FLEX", "POST", "TRNS"),
  stage_description = c(
    "larva, yolk sac",
    "larva, preflexion",
    "larva, flexion",
    "larva, postflexion",
    "larva, transformation"
  )
)

# P06 Measurement Unit vocabulary (NERC)
measurement_unit_vocab <- tibble(
  measurementUnit = c(
    "individuals",
    "millimeters",
    "dimensionless",
    "cubic meters",
    "grams",
    "proportion",
    "m^3"
  ),
  measurementUnitID = c(
    "http://vocab.nerc.ac.uk/collection/P06/current/UUUU/",
    "http://vocab.nerc.ac.uk/collection/P06/current/UXMM/",
    "http://vocab.nerc.ac.uk/collection/P06/current/UUUU/",
    "http://vocab.nerc.ac.uk/collection/P06/current/MCUB/",
    "http://vocab.nerc.ac.uk/collection/P06/current/UGRM/",
    "http://vocab.nerc.ac.uk/collection/P06/current/UUUU/", # or UPCT?
    "http://vocab.nerc.ac.uk/collection/P06/current/MCUB/"
  )
)

6 Create Event Hierarchy

DarwinCore uses an event-based model. We’ll create a hierarchical event structure:

  • cruise
    • site
      • tow
        • net
graph TD
    A[Cruise Event] -->|parentEventID| B[Site Event]
    B -->|parentEventID| C[Tow Event]
    C -->|parentEventID| D[Net Sample Event]
    D -->|eventID| E[Occurrence Records]
    D -->|eventID| F[MeasurementOrFact - Environmental]
    E -->|occurrenceID| G[MeasurementOrFact - Size]
    
    A:::cruise
    B:::site
    C:::tow
    D:::net
    E:::occurrence
    F:::measurement
    G:::measurement
    
    classDef cruise fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    classDef site fill:#fff4e1,stroke:#ff9900,stroke-width:2px
    classDef tow fill:#ffe1f5,stroke:#cc0066,stroke-width:2px
    classDef net fill:#e1ffe1,stroke:#00cc66,stroke-width:2px
    classDef occurrence fill:#f0e1ff,stroke:#6600cc,stroke-width:2px
    classDef measurement fill:#ffe1e1,stroke:#cc0000,stroke-width:2px
    
    subgraph Event Core
        A
        B
        C
        D
    end
    
    subgraph Extensions
        E
        F
        G
    end
Figure 3: Event hierarchy for CalCOFI larvae data in DarwinCore Archive format with Extended Measurement or Fact.

Per 3.3.2. Simple nested datasets with Project-level information – Guide for publishing biological survey and monitoring data to GBIF

Code
# Build complete event hierarchy
q_events <- tbl_net |>
  left_join(tbl_tow, by = "tow_uuid") |>
  left_join(tbl_tow_type, by = "tow_type_key") |>
  left_join(tbl_site, by = "site_uuid") |>
  left_join(tbl_cruise, by = "cruise_uuid") |>
  left_join(tbl_ship, by = "ship_key")

# Create event core at different levels ----

flds_to_iso8601 <- function(flds, output = "datetime") {
  # helper function to convert date/time fields to ISO 8601 format
  # per https://dwc.tdwg.org/terms/ eventDate: add 2007-03-01T13:00:00Z/2008-05-11T15:30:00Z (some time within the interval beginning 1 March 2007 at 1pm UTC and before 11 May 2008 at 3:30pm UTC)

  stopifnot(length(flds) %in% c(1, 2))
  stopifnot(output %in% c("datetime", "time"))

  if (length(flds) == 1 & output == "datetime") {
    s <- glue::glue(
      "to_char({flds}, 'YYYY-MM-DD\"T\"HH24:MI:SS\"Z\"')"
    )
  }
  if (length(flds) == 2 & output == "datetime") {
    s <- glue::glue(
      "to_char({flds[1]}, 'YYYY-MM-DD\"T\"HH24:MI:SS\"Z\"') || '/' || 
         to_char({flds[2]}, 'YYYY-MM-DD\"T\"HH24:MI:SS\"Z\"')"
    )
  }

  if (length(flds) == 1 & output == "time") {
    s <- glue::glue(
      "to_char({flds}, 'HH24:MI:SS\"Z\"')"
    )
  }
  if (length(flds) == 2 & output == "time") {
    s <- glue::glue(
      "to_char({flds[1]}, 'HH24:MI:SS\"Z\"') || '/' || 
       to_char({flds[2]}, 'HH24:MI:SS\"Z\"')"
    )
  }

  as.character(s)
}

# * Cruise level events ----

# get min/max datetimes for cruises
q_cruise_event_date <- q_events |>
  group_by(cruise_uuid) |>
  summarize(
    dtime_min = min(time_start, na.rm = T),
    dtime_max = max(time_start, na.rm = T),
    .groups = "drop"
  ) |>
  mutate(
    eventDate = sql(flds_to_iso8601(c("dtime_min", "dtime_max")))
  ) |>
  select(cruise_uuid, eventDate)

q_cruise_events <- q_events |>
  distinct(cruise_uuid, ship_name, ship_nodc, date_ym) |>
  mutate(
    eventID = cruise_uuid,
    # parentEventID  = NA, # no parent for cruise level; NA generated with union_all
    eventRemarks = paste("Cruise on", ship_name),
    samplingProtocol = "Marine plankton survey cruise",
    sampleSizeValue = NA_real_,
    sampleSizeUnit = NA_character_,
    habitat = "marine",
    eventType = "cruise"
  ) |>
  left_join(
    q_cruise_event_date,
    by = "cruise_uuid"
  ) |>
  select(
    eventID, # parentEventID,
    eventType,
    eventDate,
    samplingProtocol,
    eventRemarks,
    habitat
  )

# Site level events
q_site_events <- q_events |>
  distinct(
    site_uuid,
    cruise_uuid,
    longitude,
    latitude,
    line,
    station,
    orderocc
  ) |>
  mutate(
    eventID = site_uuid,
    parentEventID = cruise_uuid,
    decimalLatitude = latitude,
    decimalLongitude = longitude,
    eventRemarks = paste("CalCOFI station", station, "on line", line),
    # per https://proj.org/en/stable/operations/projections/calcofi.html
    locationID = paste0(line, "_", station),
    eventType = "site",
    samplingProtocol = "Marine plankton survey station"
  ) |>
  select(
    eventID,
    parentEventID,
    eventType,
    samplingProtocol,
    eventRemarks,
    decimalLatitude,
    decimalLongitude,
    locationID
  )

# Tow level events
q_tow_events <- q_events |>
  distinct(
    tow_uuid,
    site_uuid,
    tow_number,
    tow_type_key,
    tow_type_description = description,
    time_start
  ) |>
  mutate(
    eventID = tow_uuid,
    parentEventID = site_uuid,
    eventDate = sql(flds_to_iso8601("time_start")),
    eventRemarks = paste("Tow", tow_number, "-", tow_type_description),
    eventType = "tow",
    samplingProtocol = paste(tow_type_key, ":", tow_type_description)
  ) |>
  select(
    eventID,
    parentEventID,
    eventType,
    eventDate,
    samplingProtocol,
    eventRemarks
  )

# Net level events (sampling events)
q_net_events_allflds <- q_events |>
  distinct(
    net_uuid,
    tow_uuid,
    side,
    std_haul_factor,
    vol_sampled_m3,
    prop_sorted,
    smallplankton,
    totalplankton
  ) |>
  mutate(
    eventID = net_uuid,
    parentEventID = tow_uuid,
    eventRemarks = paste(
      "Net sample from side",
      side,
      "; only",
      prop_sorted,
      "proportion sorted of",
      vol_sampled_m3,
      "cubic meters sampled"
    ),
    eventType = "net_sample",
    samplingProtocol = paste("Plankton net tow -", side, "side"),
    sampleSizeValue = vol_sampled_m3,
    sampleSizeUnit = "m^3"
  )
q_net_events <- q_net_events_allflds |>
  select(
    eventID,
    parentEventID,
    eventType,
    samplingProtocol,
    eventRemarks,
    sampleSizeValue,
    sampleSizeUnit
  )

# Combine all event levels
db_version <- tbl(con, "schema_version") |> pull(version)
# dataset_id <- paste0(
#   # "https://github.com/CalCOFI/calcofi4db/releases/tag/v",
#   "calcofi_db.",
#   db_version,
#   "_larvae.",
#   format(Sys.Date(), "%Y-%m-%d")
# )
# "calcofi_db.1.1.0_larvae.2025-10-09"

q_event_core <- list(
  q_cruise_events |>
    mutate(
      parentEventID = sql("NULL::uuid"),
      sampleSizeValue = sql("NULL::numeric"),
      sampleSizeUnit = NA_character_
    ),
  q_site_events |>
    mutate(
      sampleSizeValue = sql("NULL::numeric"),
      sampleSizeUnit = NA_character_
    ),
  q_tow_events |>
    mutate(
      sampleSizeValue = sql("NULL::numeric"),
      sampleSizeUnit = NA_character_
    ),
  q_net_events
) |>
  reduce(union_all) |>
  mutate(
    # basisOfRecord                 = "HumanObservation",
    geodeticDatum = "WGS84",
    coordinateUncertaintyInMeters = 1000, # TODO: adjust based on accuracy of GPS/Loran/dead-reckoning over time
    countryCode = country_code, # country_code <- "US"
    waterBody = water_body,
    datasetID = dataset_id
  )
# TODO: datasetID: permalink to this qmd in calcofi4db release?

7 Create Occurrence Records

Now let’s combine all biological observations into occurrence records:

Code
# Create consolidated egg occurrences (one per species per net)
q_egg_occurrences <- tbl_egg |>
  left_join(tbl_species, by = "species_id") |>
  mutate(
    occurrenceID = sql("gen_random_uuid()"),
    eventID = net_uuid,
    scientificName = scientific_name,
    scientificNameID = paste0("urn:lsid:marinespecies.org:taxname:", worms_id),
    kingdom = "Animalia",
    occurrenceStatus = ifelse(tally > 0, "present", "absent"),
    organismQuantity = tally, # TOTAL count from egg table
    organismQuantityType = "individuals",
    lifeStage = "egg",
    #lifeStage            = "https://vocab.nerc.ac.uk/collection/S11/current/S1122/",
    # lifeStageID          = "http://vocab.nerc.ac.uk/collection/S11/current/S1110/",
    preparations = "egg sample",
    basisOfRecord = "HumanObservation"
  ) |>
  select(
    occurrenceID,
    eventID,
    species_id,
    scientificName,
    scientificNameID,
    kingdom,
    occurrenceStatus,
    organismQuantity,
    organismQuantityType,
    lifeStage, # lifeStageID,
    preparations,
    basisOfRecord
  )

# Create consolidated larva occurrences (one per species per net)
q_larva_occurrences <- tbl_larva |>
  left_join(tbl_species, by = "species_id") |>
  mutate(
    occurrenceID = sql("gen_random_uuid()"),
    eventID = net_uuid,
    scientificName = scientific_name,
    scientificNameID = paste0("urn:lsid:marinespecies.org:taxname:", worms_id),
    kingdom = "Animalia",
    occurrenceStatus = ifelse(tally > 0, "present", "absent"),
    organismQuantity = tally, # TOTAL count from larva table
    organismQuantityType = "individuals",
    lifeStage = "larva",
    # lifeStageID          = "http://vocab.nerc.ac.uk/collection/S11/current/S1133/",
    preparations = "larva sample",
    basisOfRecord = "HumanObservation"
  ) |>
  select(
    occurrenceID,
    eventID,
    species_id,
    scientificName,
    scientificNameID,
    kingdom,
    occurrenceStatus,
    organismQuantity,
    organismQuantityType,
    lifeStage, # lifeStageID,
    preparations,
    basisOfRecord
  )

# Combine egg and larva occurrences
q_occurrence_extension <- list(
  q_egg_occurrences,
  q_larva_occurrences
) |>
  reduce(union_all) |>
  # Add additional required/recommended fields
  mutate(
    # TODO: identifiedBy, dateIdentified,identificationReferences
    # identifiedBy = "[NEED IDENTIFIER NAME]", # Prompt: Add taxonomist name
    # dateIdentified = "[NEED IDENTIFICATION DATE]", # Prompt: Add identification date
    # identificationReferences = "[NEED REFERENCES]", # Prompt: Add identification guides used
    modified = sql("CURRENT_DATE")
  ) |>
  # Remove species_id (internal field used only for joins)
  select(
    occurrenceID,
    eventID,
    scientificName,
    scientificNameID,
    kingdom,
    occurrenceStatus,
    organismQuantity,
    organismQuantityType,
    lifeStage, # lifeStageID,
    preparations,
    basisOfRecord,
    modified
  )

8 Create ExtendedMeasurementOrFact Extension

Create eMoF records for three types of measurements:

  1. Sample-level measurements (linked to eventID)
  2. Stage-specific abundance (linked to occurrenceID)
  3. Body length measurements (linked to occurrence ID)
Code
# 1. Sample-level measurements (eventID, occurrenceID = NA) ----
d_sample_measurements <- q_net_events_allflds |>
  select(
    eventID,
    std_haul_factor,
    prop_sorted,
    smallplankton,
    totalplankton
  ) |> # vol_sampled_m3 moved to Event core
  collect() |>
  pivot_longer(
    cols = c(std_haul_factor, prop_sorted, smallplankton, totalplankton),
    names_to = "meas_type",
    values_to = "measurementValue"
  ) |>
  filter(!is.na(measurementValue)) |>
  mutate(
    measurementID = UUIDgenerate(n = n()),
    occurrenceID = NA_character_,
    measurementType = case_when(
      meas_type == "std_haul_factor" ~ "standardized haul factor",
      meas_type == "prop_sorted" ~ "proportion of sample sorted",
      meas_type == "smallplankton" ~ "small plankton biomass",
      meas_type == "totalplankton" ~ "total plankton biomass"
    ),
    measurementTypeID = case_when(
      meas_type == "std_haul_factor" ~ NA_character_, # TODO: search P01 for standardization factor
      meas_type == "prop_sorted" ~ NA_character_, # TODO: search P01 for subsample fraction
      meas_type == "smallplankton" ~ NA_character_, # TODO: search P01 for zooplankton biomass
      meas_type == "totalplankton" ~ NA_character_
    ), # TODO: search P01 for zooplankton biomass
    measurementUnit = case_when(
      meas_type == "std_haul_factor" ~ "dimensionless",
      meas_type == "prop_sorted" ~ "dimensionless",
      meas_type %in% c("smallplankton", "totalplankton") ~ "grams"
    ),
    measurementMethod = "https://oceanview.pfeg.noaa.gov/CalCOFI/calcofi_info.html",
    measurementRemarks = case_when(
      meas_type ==
        "std_haul_factor" ~ "Standardization factor accounting for water filtered; abundances expressed as per 10 m² (Smith 1977)",
      meas_type == "prop_sorted" ~ "Fraction of total sample examined",
      TRUE ~ NA_character_
    )
  ) |>
  left_join(measurement_unit_vocab, by = "measurementUnit") |>
  select(
    eventID,
    occurrenceID,
    measurementID,
    measurementType,
    measurementTypeID,
    measurementValue,
    measurementUnit,
    measurementUnitID,
    measurementMethod,
    measurementRemarks
  )

# 2. Stage-specific abundance (eventID = NA, occurrenceID) ----

# Egg stage abundances - need to join with egg occurrences to get occurrenceID
d_egg_stage_abundance <- tbl_egg_stage |>
  left_join(
    egg_stage_vocab |>
      copy_to(
        con,
        df = _,
        name = "egg_stage_vocab_tmp",
        temporary = TRUE,
        overwrite = T
      ),
    by = "stage"
  ) |>
  collect() |>
  left_join(
    q_egg_occurrences |>
      collect() |>
      select(net_uuid = eventID, species_id, occurrenceID),
    by = c("net_uuid", "species_id")
  ) |>
  mutate(
    eventID = NA_character_,
    measurementID = UUIDgenerate(n = n()),
    measurementType = "abundance by life stage",
    measurementTypeID = NA_character_, # TODO: search P01 for "abundance" + "development stage"
    measurementValue = tally,
    measurementValueID = NA_character_, # Could reference custom vocabulary for egg stages
    measurementUnit = "individuals",
    measurementMethod = "https://oceanview.pfeg.noaa.gov/CalCOFI/calcofi_info.html",
    measurementRemarks = stage_description
  ) |>
  left_join(measurement_unit_vocab, by = "measurementUnit") |>
  select(
    eventID,
    occurrenceID,
    measurementID,
    measurementType,
    measurementTypeID,
    measurementValue,
    measurementValueID,
    #measurementUnit, measurementUnitID = measurementUnitID.y, measurementMethod,
    measurementUnit,
    measurementUnitID,
    measurementMethod,
    measurementRemarks
  )

# Larva stage abundances - need to join with larva occurrences to get occurrenceID
d_larva_stage_abundance <- tbl_larva_stage |>
  left_join(
    larva_stage_vocab |>
      copy_to(
        con,
        df = _,
        name = "larva_stage_vocab_tmp",
        temporary = TRUE,
        overwrite = T
      ),
    by = "stage"
  ) |>
  collect() |>
  left_join(
    q_larva_occurrences |>
      collect() |>
      select(net_uuid = eventID, species_id, occurrenceID),
    by = c("net_uuid", "species_id")
  ) |>
  mutate(
    eventID = NA_character_,
    measurementID = UUIDgenerate(n = n()),
    measurementType = "abundance by life stage",
    measurementTypeID = NA_character_, # TODO: search P01 for "abundance" + "development stage"
    measurementValue = tally,
    measurementValueID = NA_character_, # Could reference custom vocabulary for larva stages
    measurementUnit = "individuals",
    measurementMethod = "https://oceanview.pfeg.noaa.gov/CalCOFI/calcofi_info.html",
    measurementRemarks = stage_description
  ) |>
  left_join(measurement_unit_vocab, by = "measurementUnit") |>
  select(
    eventID,
    occurrenceID,
    measurementID,
    measurementType,
    measurementTypeID,
    measurementValue,
    measurementValueID,
    measurementUnit,
    measurementUnitID,
    measurementMethod,
    measurementRemarks
  )

# 3. Body length measurements (eventID = NA, occurrenceID) ----
d_body_length <- tbl_larva_size |>
  collect() |>
  left_join(
    q_larva_occurrences |>
      collect() |>
      select(net_uuid = eventID, species_id, occurrenceID),
    by = c("net_uuid", "species_id")
  ) |>
  filter(!is.na(length_mm)) |>
  mutate(
    eventID = NA_character_,
    measurementID = UUIDgenerate(n = n()),
    measurementType = "body length",
    measurementTypeID = NA_character_, # TODO: search P01 for fish larvae body length (e.g., OBSINDLX)
    measurementValue = length_mm,
    measurementValueID = NA_character_,
    measurementUnit = "millimeters",
    measurementMethod = "https://oceanview.pfeg.noaa.gov/CalCOFI/calcofi_info.html",
    measurementRemarks = paste0("Count: ", tally, "; Total length measurement")
  ) |>
  left_join(measurement_unit_vocab, by = "measurementUnit") |>
  select(
    eventID,
    occurrenceID,
    measurementID,
    measurementType,
    measurementTypeID,
    measurementValue,
    measurementValueID,
    measurementUnit,
    measurementUnitID,
    measurementMethod,
    measurementRemarks
  )

# Combine all eMoF records ----
d_extendedmeasurementorfact_extension <- bind_rows(
  d_sample_measurements,
  d_egg_stage_abundance,
  d_larva_stage_abundance,
  d_body_length
)

# table(is.na(d_extendedmeasurementorfact_extension$occurrenceID))
#   FALSE    TRUE
# 367,218 243,598
#
# table(is.na(d_extendedmeasurementorfact_extension$eventID))
#   FALSE    TRUE
# 243,598 367,218

9 Write DarwinCore Archive Files

Export the event core and extension files (occurrence, eMoF) as CSV files:

Code
# Create output directory
dir.create(dir_out, showWarnings = F)

# Write core and extension files
write_csv(q_event_core |> collect(), file.path(dir_out, "event.csv"))
write_csv(
  q_occurrence_extension |> collect(),
  file.path(dir_out, "occurrence.csv")
)
write_csv(
  d_extendedmeasurementorfact_extension,
  file.path(dir_out, "extendedMeasurementOrFact.csv")
)

10 Create meta.xml programmatically by cross-walking CSV fields to DwC terms

Generate meta.xml by mapping CSV column names to DarwinCore term URIs:

Code
# Define DarwinCore term mappings for each file type
dwc_terms <- list(
  event = list(
    rowType = "http://rs.tdwg.org/dwc/terms/Event",
    idField = "eventID",
    terms = c(
      eventID = "http://rs.tdwg.org/dwc/terms/eventID",
      parentEventID = "http://rs.tdwg.org/dwc/terms/parentEventID",
      eventType = "http://rs.tdwg.org/dwc/terms/eventType",
      eventDate = "http://rs.tdwg.org/dwc/terms/eventDate",
      eventTime = "http://rs.tdwg.org/dwc/terms/eventTime",
      samplingProtocol = "http://rs.tdwg.org/dwc/terms/samplingProtocol",
      sampleSizeValue = "http://rs.tdwg.org/dwc/terms/sampleSizeValue",
      sampleSizeUnit = "http://rs.tdwg.org/dwc/terms/sampleSizeUnit",
      eventRemarks = "http://rs.tdwg.org/dwc/terms/eventRemarks",
      habitat = "http://rs.tdwg.org/dwc/terms/habitat",
      decimalLatitude = "http://rs.tdwg.org/dwc/terms/decimalLatitude",
      decimalLongitude = "http://rs.tdwg.org/dwc/terms/decimalLongitude",
      geodeticDatum = "http://rs.tdwg.org/dwc/terms/geodeticDatum",
      coordinateUncertaintyInMeters = "http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters",
      locationID = "http://rs.tdwg.org/dwc/terms/locationID",
      countryCode = "http://rs.tdwg.org/dwc/terms/countryCode",
      waterBody = "http://rs.tdwg.org/dwc/terms/waterBody",
      datasetID = "http://rs.tdwg.org/dwc/terms/datasetID"
      # basisOfRecord                 = "http://rs.tdwg.org/dwc/terms/basisOfRecord"
    )
  ),
  occurrence = list(
    rowType = "http://rs.tdwg.org/dwc/terms/Occurrence",
    idField = "occurrenceID",
    coreIdField = "eventID",
    terms = c(
      occurrenceID = "http://rs.tdwg.org/dwc/terms/occurrenceID",
      eventID = "http://rs.tdwg.org/dwc/terms/eventID",
      scientificName = "http://rs.tdwg.org/dwc/terms/scientificName",
      scientificNameID = "http://rs.tdwg.org/dwc/terms/scientificNameID",
      kingdom = "http://rs.tdwg.org/dwc/terms/kingdom",
      occurrenceStatus = "http://rs.tdwg.org/dwc/terms/occurrenceStatus",
      organismQuantity = "http://rs.tdwg.org/dwc/terms/organismQuantity",
      organismQuantityType = "http://rs.tdwg.org/dwc/terms/organismQuantityType",
      lifeStage = "http://rs.tdwg.org/dwc/terms/lifeStage",
      # lifeStageID          = "http://rs.tdwg.org/dwc/terms/lifeStageID",
      preparations = "http://rs.tdwg.org/dwc/terms/preparations",
      basisOfRecord = "http://rs.tdwg.org/dwc/terms/basisOfRecord",
      modified = "http://purl.org/dc/terms/modified"
    )
  ),
  extendedmeasurementorfact = list(
    rowType = "http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact",
    idField = "measurementID",
    coreIdField = "eventID", # Primary linking field
    terms = c(
      eventID = "http://rs.tdwg.org/dwc/terms/eventID",
      occurrenceID = "http://rs.tdwg.org/dwc/terms/occurrenceID",
      measurementID = "http://rs.tdwg.org/dwc/terms/measurementID",
      measurementType = "http://rs.tdwg.org/dwc/terms/measurementType",
      measurementTypeID = "http://rs.iobis.org/obis/terms/measurementTypeID",
      measurementValue = "http://rs.tdwg.org/dwc/terms/measurementValue",
      measurementValueID = "http://rs.iobis.org/obis/terms/measurementValueID",
      measurementUnit = "http://rs.tdwg.org/dwc/terms/measurementUnit",
      measurementUnitID = "http://rs.iobis.org/obis/terms/measurementUnitID",
      measurementMethod = "http://rs.tdwg.org/dwc/terms/measurementMethod",
      measurementRemarks = "http://rs.tdwg.org/dwc/terms/measurementRemarks"
    )
  )
)

# Function to create field elements for meta.xml
create_field_elements <- function(csv_file, term_map, coreid_field = NULL) {
  # Read column names from CSV
  col_names <- names(read_csv(csv_file, n_max = 0, show_col_types = FALSE))

  # Map columns to DwC terms
  field_elements <- map(seq_along(col_names), function(i) {
    col <- col_names[i]

    # Skip coreid column if specified (it's defined separately in <coreid> element)
    if (!is.null(coreid_field) && col == coreid_field) {
      return(NULL)
    }

    term <- term_map$terms[[col]]

    if (!is.null(term)) {
      glue::glue('    <field index="{i-1}" term="{term}"/>')
    } else {
      # Field exists in CSV but not in DwC mapping - skip or warn
      message(
        "Warning: Column '",
        col,
        "' in ",
        csv_file,
        " has no DwC mapping and will be skipped in meta.xml."
      )
      NULL
    }
  })

  # Remove NULL entries and combine
  field_elements <- compact(field_elements)
  paste(field_elements, collapse = "\n")
}

# Function to determine coreID index
get_coreid_index <- function(csv_file, coreid_field) {
  col_names <- names(read_csv(csv_file, n_max = 0, show_col_types = FALSE))

  # Handle multiple possible coreID fields (eventID or occurrenceID)
  if (length(coreid_field) > 1) {
    # Find which one exists in the CSV
    coreid_field <- intersect(coreid_field, col_names)[1]
  }

  which(col_names == coreid_field) - 1 # 0-indexed
}

# Generate meta.xml content
generate_meta_xml <- function(dir_out) {
  # Core (Event)
  event_file <- file.path(dir_out, "event.csv")
  event_fields <- create_field_elements(event_file, dwc_terms$event)
  event_id_idx <- which(
    names(read_csv(event_file, n_max = 0, show_col_types = FALSE)) ==
      dwc_terms$event$idField
  ) -
    1

  # Extension: Occurrence
  occ_file <- file.path(dir_out, "occurrence.csv")
  occ_fields <- create_field_elements(
    occ_file,
    dwc_terms$occurrence,
    coreid_field = dwc_terms$occurrence$coreIdField
  )
  occ_coreid_idx <- get_coreid_index(occ_file, dwc_terms$occurrence$coreIdField)

  # Extension: ExtendedMeasurementOrFact
  emof_file <- file.path(dir_out, "extendedMeasurementOrFact.csv")
  emof_fields <- create_field_elements(
    emof_file,
    dwc_terms$extendedmeasurementorfact,
    coreid_field = dwc_terms$extendedmeasurementorfact$coreIdField
  )
  emof_coreid_idx <- get_coreid_index(
    emof_file,
    dwc_terms$extendedmeasurementorfact$coreIdField
  )

  # Build complete meta.xml
  meta_xml <- glue::glue(
    '<?xml version="1.0" encoding="UTF-8"?>
<archive xmlns="http://rs.tdwg.org/dwc/text/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://rs.tdwg.org/dwc/text/ http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd">
  <core encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\\n"
        fieldsEnclosedBy=\'"\' ignoreHeaderLines="1" rowType="{dwc_terms$event$rowType}">
    <files>
      <location>event.csv</location>
    </files>
    <id index="{event_id_idx}" />
{event_fields}
  </core>
  <extension encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\\n"
             fieldsEnclosedBy=\'"\' ignoreHeaderLines="1" rowType="{dwc_terms$occurrence$rowType}">
    <files>
      <location>occurrence.csv</location>
    </files>
    <coreid index="{occ_coreid_idx}" />
{occ_fields}
  </extension>
  <extension encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\\n"
             fieldsEnclosedBy=\'"\' ignoreHeaderLines="1" rowType="{dwc_terms$extendedmeasurementorfact$rowType}">
    <files>
      <location>extendedMeasurementOrFact.csv</location>
    </files>
    <coreid index="{emof_coreid_idx}" />
{emof_fields}
  </extension>
</archive>'
  )

  return(meta_xml)
}

# Generate and write meta.xml
meta_xml <- generate_meta_xml(dir_out)
writeLines(meta_xml, file.path(dir_out, "meta.xml"))

cat("Generated meta.xml with field mappings:\n")
Generated meta.xml with field mappings:
Code
cat(meta_xml)
<?xml version="1.0" encoding="UTF-8"?>
<archive xmlns="http://rs.tdwg.org/dwc/text/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://rs.tdwg.org/dwc/text/ http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd">
  <core encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\n"
        fieldsEnclosedBy='"' ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Event">
    <files>
      <location>event.csv</location>
    </files>
    <id index="0" />
    <field index="0" term="http://rs.tdwg.org/dwc/terms/eventID"/>
    <field index="1" term="http://rs.tdwg.org/dwc/terms/eventType"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/eventDate"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/samplingProtocol"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/eventRemarks"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/habitat"/>
    <field index="6" term="http://rs.tdwg.org/dwc/terms/parentEventID"/>
    <field index="7" term="http://rs.tdwg.org/dwc/terms/sampleSizeValue"/>
    <field index="8" term="http://rs.tdwg.org/dwc/terms/sampleSizeUnit"/>
    <field index="9" term="http://rs.tdwg.org/dwc/terms/decimalLatitude"/>
    <field index="10" term="http://rs.tdwg.org/dwc/terms/decimalLongitude"/>
    <field index="11" term="http://rs.tdwg.org/dwc/terms/locationID"/>
    <field index="12" term="http://rs.tdwg.org/dwc/terms/geodeticDatum"/>
    <field index="13" term="http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters"/>
    <field index="14" term="http://rs.tdwg.org/dwc/terms/countryCode"/>
    <field index="15" term="http://rs.tdwg.org/dwc/terms/waterBody"/>
    <field index="16" term="http://rs.tdwg.org/dwc/terms/datasetID"/>
  </core>
  <extension encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\n"
             fieldsEnclosedBy='"' ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
    <files>
      <location>occurrence.csv</location>
    </files>
    <coreid index="1" />
    <field index="0" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/scientificNameID"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/kingdom"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/occurrenceStatus"/>
    <field index="6" term="http://rs.tdwg.org/dwc/terms/organismQuantity"/>
    <field index="7" term="http://rs.tdwg.org/dwc/terms/organismQuantityType"/>
    <field index="8" term="http://rs.tdwg.org/dwc/terms/lifeStage"/>
    <field index="9" term="http://rs.tdwg.org/dwc/terms/preparations"/>
    <field index="10" term="http://rs.tdwg.org/dwc/terms/basisOfRecord"/>
    <field index="11" term="http://purl.org/dc/terms/modified"/>
  </extension>
  <extension encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\n"
             fieldsEnclosedBy='"' ignoreHeaderLines="1" rowType="http://rs.iobis.org/obis/terms/ExtendedMeasurementOrFact">
    <files>
      <location>extendedMeasurementOrFact.csv</location>
    </files>
    <coreid index="0" />
    <field index="1" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/measurementID"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/measurementType"/>
    <field index="4" term="http://rs.iobis.org/obis/terms/measurementTypeID"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/measurementValue"/>
    <field index="6" term="http://rs.tdwg.org/dwc/terms/measurementUnit"/>
    <field index="7" term="http://rs.iobis.org/obis/terms/measurementUnitID"/>
    <field index="8" term="http://rs.tdwg.org/dwc/terms/measurementMethod"/>
    <field index="9" term="http://rs.tdwg.org/dwc/terms/measurementRemarks"/>
    <field index="10" term="http://rs.iobis.org/obis/terms/measurementValueID"/>
  </extension>
</archive>

11 Create EML Metadata

Build EML (Ecological Metadata Language) document with geographic, temporal, and taxonomic coverage:

Code
# Geographic coverage
geo_coverage <- tbl_site |>
  summarise(
    westBoundingCoordinate = min(longitude, na.rm = T),
    eastBoundingCoordinate = max(longitude, na.rm = T),
    northBoundingCoordinate = max(latitude, na.rm = T),
    southBoundingCoordinate = min(latitude, na.rm = T)
  ) |>
  collect()

# Temporal coverage
temp_coverage <- tbl_tow |>
  summarise(
    beginDate = min(time_start, na.rm = TRUE),
    endDate = max(time_start, na.rm = TRUE)
  ) |>
  collect() |>
  pivot_longer(
    cols = everything(),
    names_to = "name",
    values_to = "value"
  ) |>
  mutate(
    value = as.Date(value)
  ) |>
  deframe() |>
  as.list()

# Taxonomic coverage

librarian::shelf(duckdb)
db <- "https://file.calcofi.io/data/calcofi.duckdb"
tmp_db <- file.path(tempdir(), "calcofi.duckdb")
con_dk <- dbConnect(duckdb::duckdb(), dbdir = tmp_db, read_only = F)
# attach remote duckdb database, but first detect if already attached
if (!"taxonomy" %in% dbListTables(con_dk)) {
  dbSendQuery(con_dk, glue::glue("ATTACH DATABASE '{db}' AS cc;"))
}
# dbListTables(con)
#  [1] "bottle"         "cast"           "cruise"         "egg"            "egg_stage"      "grid"
#  [7] "larva"          "larva_size"     "larva_stage"    "net"            "schema_version" "ship"
# [13] "site"           "site_seg"       "species"        "taxa_rank"      "taxonomy"       "tow"
# [19] "tow_type"
d_taxa <- dbReadTable(con_dk, Id(schema = "cc", table = "taxonomy"))

# WoRMS superseded identifiers for some CalCOFI taxa, otherwise missing in d_taxa so no scientificName found (EML fails validation)
d_worms_superseded = tribble(
  ~sci_old             , ~worms_old , ~sci_new                , ~worms_new ,
  "Serraninae"         ,     151756 , "Serranidae"            ,     125561 ,
  "Anthiadinae"        ,    1061345 , "Anthiadidae"           ,    1758484 ,
  "Epinephelinae"      ,     151758 , "Epinephelidae"         ,    1776907 ,
  "Paranthias colonus" ,     282083 , "Cephalopholis colonus" ,    1750630
)

taxa_coverage <- tbl_species |>
  mutate(
    worms_id = case_when(
      is.na(worms_id) | worms_id == 0 ~ NA_integer_,
      TRUE ~ worms_id
    )
  ) |>
  collect() |>
  # replace worms_id with updated WoRMS IDs where applicable
  left_join(
    d_worms_superseded |> select(worms_old, worms_new),
    by = join_by(worms_id == worms_old)
  ) |>
  mutate(
    worms_id = case_when(
      !is.na(worms_new) ~ worms_new,
      TRUE ~ worms_id
    )
  ) |>
  select(-worms_new) |>
  filter(!is.na(worms_id)) |>
  distinct(species_id, worms_id, common_name) |>
  left_join(
    d_taxa |>
      distinct(
        taxonID,
        taxonRank,
        scientificName
      ),
    by = join_by(worms_id == taxonID)
  ) |>
  # De-duplicate in case multiple local species map to same WoRMS ID
  group_by(worms_id, taxonRank, scientificName) |>
  summarize(
    species_id = first(species_id),
    common_name = paste(unique(na.omit(common_name)), collapse = ", "),
    .groups = "drop"
  ) |>
  arrange(species_id) |>
  pmap(function(
    species_id,
    taxonRank,
    scientificName,
    common_name,
    worms_id,
    ...
  ) {
    item <- list(
      id = glue("calcofi_species_{species_id}"),
      taxonRankName = taxonRank,
      taxonRankValue = scientificName,
      taxonId = list(
        provider = "https://www.marinespecies.org",
        as.character(worms_id)
      )
    )

    if (!is.na(common_name) && common_name != "") {
      item$commonName <- common_name
    }

    return(item)
  })

# Metadata validation result: Invalid
#  Basic Metadata  Edit
# eml.intellectualRights.license: Data Licence is required.
#  Taxonomic Coverage  Edit
# eml.taxonomicCoverages[0].taxonKeywords[340].scientificName: Scientific Name is required.
# eml.taxonomicCoverages[0].taxonKeywords[345].scientificName: Scientific Name is required.
# eml.taxonomicCoverages[0].taxonKeywords[348].scientificName: Scientific Name is required.
# eml.taxonomicCoverages[0].taxonKeywords[363].scientificName: Scientific Name is required.

# find taxa_coverage items with missing scientificName
missing_taxa <- keep(taxa_coverage, ~ is.na(.x$taxonRankValue))
missing_sp_ids <- map_chr(missing_taxa, "id") |>
  stringr::str_replace("calcofi_species_", "") |>
  as.integer()
missing_sp <- tbl_species |>
  filter(species_id %in% missing_sp_ids)
#   species_id scientific_name    itis_id worms_id common_name
#        <int> <chr>                <int>    <int> <chr>
# 1        430 Serraninae          643089   151756 NA
# 2        435 Anthiinae           167675  1061345 NA
# 3        438 Grammistini         643093   151758 Soapfishes
# 4        453 Paranthias colonus  551113   282083 Pacific creolefish
if (length(missing_taxa) > 0) {
  cat("WARNING: The following taxa are missing scientific names:\n")
  print(missing_taxa)
  cat(
    "\nPlease add scientific names for these taxa before finalizing the dataset.\n"
  )
}

# Create EML document
my_eml <- list(
  packageId = dataset_id,
  system = "calcofi.io",
  dataset = list(
    title = dataset_title,
    intellectualRights = intellectual_rights,
    creator = list(
      individualName = list(
        givenName = "Ed",
        surName = "Weber"
      ),
      organizationName = "NOAA SWFSC",
      electronicMailAddress = "ed.weber@noaa.gov",
      userId = list(
        directory = "https://orcid.org/",
        userId = "0000-0002-0942-434X"
      )
    ),
    abstract = dataset_abstract,
    keywordSet = list(
      keyword = dataset_keywords,
      keywordThesaurus = "GCMD Science Keywords"
    ),
    coverage = list(
      geographicCoverage = list(
        geographicDescription = "California Current Large Marine Ecoregion",
        boundingCoordinates = geo_coverage
      ),
      temporalCoverage = list(
        rangeOfDates = list(
          beginDate = list(
            calendarDate = as.character(temp_coverage$beginDate)
          ),
          endDate = list(calendarDate = as.character(temp_coverage$endDate))
        )
      ),
      taxonomicCoverage = list(
        generalTaxonomicCoverage = "Marine ichthyoplankton and fish eggs",
        taxonomicClassification = taxa_coverage
      )
    ),
    contact = list(
      individualName = list(
        givenName = "Ed",
        surName = "Weber"
      ),
      electronicMailAddress = "ed.weber@noaa.gov"
    ),
    methods = list(
      methodStep = list(
        description = list(
          para = sampling_methods_description
        )
      ),
      sampling = list(
        studyExtent = list(
          description = list(
            para = study_extent_description
          )
        ),
        samplingDescription = list(
          para = sampling_description
        )
      )
    ),
    project = list(
      title = dataset_title,
      personnel = list(
        individualName = list(
          givenName = "Ed",
          surName = "Weber"
        ),
        role = "Data Manager"
      ),
      funding = list(
        para = funding_information
      )
    )
  )
)

eml_xml <- glue("{dir_out}/eml.xml")

# my_eml <- EML::read_eml(eml_xml)

# Write EML
write_eml(my_eml, eml_xml)

# replace license placeholder with actual license xml
librarian::shelf(stringr)
readLines(eml_xml) |>
  stringr::str_replace(fixed(names(license_xml)), license_xml) |>
  writeLines(eml_xml)

validation_result <- eml_validate(eml_xml)
if (!validation_result) {
  cat("EML validation errors:\n")
  print(attr(validation_result, "errors"))
  stop()
} else {
  cat("EML is valid!\n")
}
EML is valid!

12 Data Quality Checks

Validate data quality: check for missing WoRMS IDs, orphan records, and output summary statistics:

Code
# Check for missing WoRMS IDs
d_missing_worms <- tbl_species |>
  filter(is.na(worms_id)) |>
  select(species_id, scientific_name) |>
  collect()

if (nrow(d_missing_worms) > 0) {
  cat("WARNING: The following species are missing WoRMS IDs:\n")
  print(d_missing_worms)
  cat(
    "\nPlease add WoRMS AphiaIDs for these species before finalizing the dataset.\n"
  )
}

# Check for orphan records
d_orphan_occurrences <- q_occurrence_extension |>
  anti_join(q_event_core, by = "eventID") |>
  collect()

if (nrow(d_orphan_occurrences) > 0) {
  cat(
    "\nWARNING: Found",
    nrow(d_orphan_occurrences),
    "occurrence records without matching events.\n"
  )
}

# Summary statistics
n_events <- q_event_core |> summarize(n = n()) |> pull(n)
n_events <- q_event_core |>
  filter(eventType == "net_sample") |>
  summarize(n = n()) |>
  pull(n)
n_cruises <- q_event_core |>
  filter(eventType == "cruise") |>
  summarize(n = n()) |>
  pull(n)
n_sites <- q_event_core |>
  filter(eventType == "site") |>
  summarize(n = n()) |>
  pull(n)
n_tows <- q_event_core |>
  filter(eventType == "tow") |>
  summarize(n = n()) |>
  pull(n)
n_nets <- q_event_core |>
  filter(eventType == "net_sample") |>
  summarize(n = n()) |>
  pull(n)
n_occs <- q_occurrence_extension |> summarize(n = n()) |> pull(n)
n_species <- q_occurrence_extension |>
  distinct(scientificName) |>
  summarize(n = n()) |>
  pull(n)
n_measures <- d_extendedmeasurementorfact_extension |> nrow()

glue(
  "
  === Dataset Summary ===
  Total events:        {prettyNum(n_events, big.mark = ',')}
  - Cruises:           {prettyNum(n_cruises, big.mark = ',')}
  - Sites:             {prettyNum(n_sites, big.mark = ',')}
  - Tows:              {prettyNum(n_tows, big.mark = ',')}
  - Net samples:       {prettyNum(n_nets, big.mark = ',')}
  Total occurrences:   {prettyNum(n_occs, big.mark = ',')}
  Total species:       {prettyNum(n_species, big.mark = ',')}
  Total measurements:  {prettyNum(n_measures, big.mark = ',')}"
)
=== Dataset Summary ===
Total events:        76,512
- Cruises:           676
- Sites:             57,914
- Tows:              75,434
- Net samples:       76,512
Total occurrences:   463,655
Total species:       899
Total measurements:  610,816

13 Package Creation

Create the final DarwinCore Archive as a zip file containing all CSV files, meta.xml, and eml.xml:

Code
# Create DwC-A zip file
zip_file <- file.path(dir_out, paste0("../larvae_", Sys.Date(), ".zip"))
zip(
  zip_file,
  files = file.path(
    dir_out,
    c(
      "event.csv",
      "occurrence.csv",
      "extendedMeasurementOrFact.csv",
      "meta.xml",
      "eml.xml"
    )
  ),
  flags = "-j"
) # exclude parent folders

cat("\nDarwin Core Archive created:", zip_file, "\n")

Darwin Core Archive created: /Users/bbest/Github/CalCOFI/workflows/data/darwincore/larvae/../larvae_2026-01-09.zip 

14 Check CSV

Verify the exported CSV files and inspect measurement types and units:

Code
d_event <- read_csv(glue("{dir_out}/event.csv"))
d_occurrence <- read_csv(glue("{dir_out}/occurrence.csv"))
d_emof <- read_csv(glue("{dir_out}/extendedMeasurementOrFact.csv"))

# d_event
# d_occurrence
# d_emof

# d_emof |> select(measurementType, measurementUnit) |> table(useNA = "ifany")
#                              measurementUnit
# measurementType               dimensionless  grams individuals millimeters
#   abundance by life stage                 0      0      125347           0
#   body length                             0      0           0      241871
#   proportion of sample sorted         76512      0           0           0
#   small plankton biomass                  0  45287           0           0
#   standardized haul factor            76512      0           0           0
#   total plankton biomass                  0  45287           0           0

15 Dataset metadata

  • Dataset ID
    calcofi.io_workflow_larvae_to_obis_2026-01-09
    (This points to this workflow. See other https://manual.obis.org/examples.html, e.g. datasetID: https://marineinfo.org/id/dataset/6403)

  • Dataset title
    CalCOFI Fish Larvae Tows

  • Dataset abstract
    (200-500 words describing the dataset)
    Fish larvae counts and standardized counts for eggs captured in CalCOFI ichthyoplankton nets (primarily vertical [Calvet or Pairovet], oblique [bongo or ring nets], and surface tows [Manta nets]). Surface tows are normally standardized to count per 1,000 m^3 strained. Oblique tows are normally standardized to count per 10 m^2 of surface sampled.
    (Source: erdCalCOFIlrvcnt | CoastWatch ERDDAP)

  • Keywords
    (3-5 relevant keywords)
    atmosphere, biology, biosphere, calcofi, earth science, environment, latitude, longitude, ocean, time

  • Country code
    (ISO 3166-1 alpha-2)
    US

  • Water body name
    California Current

  • Sampling Methods Description
    Standard CalCOFI ichthyoplankton tows. The standard oblique tow uses a bongo net (71 cm diameter, 0.505 mm mesh) or 1-m ring net (prior to 1978) retrieved at a constant wire angle (45 degrees) from 210 m depth to surface. Surface tows use a Manta net; vertical tows use CalVET/Pairovet nets. Flowmeters measure volume filtered. Samples are preserved in 5% formalin. In the lab, fish eggs and larvae are sorted, identified to lowest taxon possible, enumerated, and measured.

  • Study Extent Description
    The study covers the California Current ecosystem, primarily off Southern California (standard lines 77-93) but historically extending from the border of Canada to the tip of Baja California. The time series for this dataset generally begins in 1951 and continues to the present, with quarterly cruises.

  • Sampling Description
    Samples are collected at fixed stations along the CalCOFI grid. Oblique tows are standardized to counts per 10 m^2 of sea surface. Surface tows are standardized to counts per 1,000 m^3. Data includes raw counts (tallies) and standardized abundances.

  • Funding Information
    CalCOFI is a partnership between the NOAA National Marine Fisheries Service (NMFS), Scripps Institution of Oceanography (SIO), and California Department of Fish and Wildlife (CDFW).

  • Intellectual Rights
    The data may be used and redistributed for free but is not intended for legal use, since it may contain inaccuracies. Neither the data Contributor, ERD, NOAA, nor the United States Government, nor any of their employees or contractors, makes any warranty, express or implied, including warranties of merchantability and fitness for a particular purpose, or assumes any legal liability for the accuracy, completeness, or usefulness, of this information. To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the Public Domain (CC0 1.0). Users may copy, modify, distribute and use the work, including for commercial purposes, without restriction.
    (Sources: erdCalCOFIlrvcnt | CoastWatch ERDDAP; Applying a License to a Dataset | GBIF IPT User Manual)