8  Data sharing

Our data sharing policies promote FAIR principles so our data are “as open as possible, as closed as necessary(European Commission 2016). Data should be open enough to facilitate efficient re-use; avoid duplicating data collection efforts; enhance scholarly rigor; and promote engagement across the research and public communities (Whyte and Pryor 2011). However, data must also be closed enough to protect grower privacy and to honor agreements with growers and other researchers (Czarnecki and Jones 2022; Korzekwa 2023).

The SOS relies on growers’ willingness to volunteer their fields for sampling and participate in the required management survey. Their willingness depends on their trust in WaSHI to protect their privacy. Only aggregated and anonymized results will be publicly available or shared. The below data privacy statement may be shared with potential participants.

Data privacy statement

All survey responses will be combined across all respondents. Results will not be reported in a way that makes individuals identifiable. Information collected in this survey are subject to release in accordance with RCW 42.56 (Public Records Act).

Procedures for anonymizing data are detailed in Section 8.2.

8.1 WaTech data categorization

Under Washington State Policy 141.10 (Securing Information Technology Assets), state agencies must classify data into categories based on the sensitivity of the data. WaTech provides guidance on the four categories of data.

Category 4: “Confidential information requiring special handling”

WaTech: Data requires strict handling requirements applied by statues or regulations.

SOS: Not applicable.

Category 3: “Confidential information”

WaTech: Data includes “personal information” as defined in RCW 42.56.590 (Security Breaches) and RCW 19.255.010 (Personal Information Disclosure). An individual’s first name or first initial and last name in combination with at least one of the following elements: social security number, driver’s license or Washington identification card number, or any account numbers that permit access to their financial account.

SOS: We do not collect the above elements in combination with grower names. However, we treat individual names, farm names, and latitude and longitude coordinates as confidential information.

Category 2: “Sensitive information”

WaTech: Data are intended for official use only and withheld unless specifically requested.

SOS: This category includes lab results and management surveys. Access to this data requires a data share agreement.

Category 1: “Public information”

WaTech: Data is not covered in any of the above categories or is already released to the public.

SOS: De-identified and aggregated data, such as the number of soil samples and from which counties and crops they were collected, fall under this category. For example, the SOS dashboard is publicly available and the map zoom is disabled at the 1:1,600,000 scale (counties level).

8.2 Maintain confidentiality

Only under special circumstances and with proper justification in the data share agreement will the following Category 3 data be released to external collaborators. Under no circumstances will these data be made publicly available.

  • Farm name
  • Grower first and last name
  • Field names that contain street names or other identifying information
  • Latitude and longitude coordinates or other geospatial identifiers
  • Any information identifying the individual farm or grower

Anonymize and aggregate Category 2 data to honor our data privacy statement by either removing or replacing Category 3 confidential information with dummy data. The {randomNames} R package can be used to replace real names with fake names. Round latitude and longitude to a precision that does not identify the farm or fields sampled.

See the example R script below, copied from the private washi-sos GitHub repository.

R code to anonymize data
# Attach packages ==============================================================

library(dplyr)
library(readxl)
library(janitor)
library(randomNames)

# Load data ====================================================================

farm <- readRDS(
  "data/sos-sample-metadata.RDS"
) |>
  select(
    c(sample_id, farm_name, producer_name, producer_id, field_name, field_id)
  )

results_wide <- readRDS(
  "data/sos-lab-results-wide.RDS"
) |>
  select(-c(
    project_id,
    lab_id:lab_analyzed_date,
    crop_type,
    crop_variety,
    texture_class
  ))

points <- readRDS(
  "data/sos-sample-locations.RDS"
) |>
  select(sample_id, longitude, latitude) |>
  mutate(across(where(is.numeric), \(x) round(x, 0)))

# Join farm info with lab data
results_wide <- left_join(farm, results_wide) |>
  relocate(year, .before = sample_id) |>
  rename(crop = crop_group)

# Join lab data with gis data
data <- left_join(results_wide, points) |>
  remove_empty("cols") |>
  relocate(latitude, longitude, .after = county)

# Remove WSU SCBG samples and 0-6/6-12 in WSDA samples
data <- data |>
  subset(!year %in% c(2020, 2021) &
    !grepl("[A-Z]", field_id)) |>
  remove_empty("cols") |>
  select(-c(pmn_nitrate_n_mg_kg:min_c_24hr_mg_c_kg_day))

# Anonymize
set.seed(123)
anon_data <- data |>
  # Farm name, producer name, sample ID
  mutate(
    farm_name = paste0("Farm ", sprintf("%03d", cur_group_id())),
    producer_name = randomNames(
      which.names = "first",
      sample.with.replacement = FALSE
    ),
    producer_id = paste0(
      paste0(sample(LETTERS, 3, replace = TRUE), collapse = ""),
      "0",
      paste0(sample(1:9, 1, replace = TRUE), collapse = "")
    ),
    field_name = paste0("Field ", field_id),
    sample_id = paste0(
      substr(year, 3, 4),
      "-",
      producer_id,
      "-",
      field_id
    ),
    .by = c(county, producer_name, producer_id)
  ) |>
  # county
  mutate(
    county = paste0("County ", cur_group_id()),
    .by = c(county)
  )

# Grab 100 random samples
anon_subset <- anon_data |>
  slice_sample(n = 100) |>
  arrange(sample_id)

write.csv(
  anon_subset,
  file = "data/example-data.csv",
  na = "",
  row.names = FALSE
)

8.3 Data share agreement

SOS CoPIs created a Data Sharing and Scope of Work Agreement detailing the type of data to be shared, the scope of work in which the data may be used, and terms for using SOS data.

Once both CoPIs and the “Partnering Scientists” sign the agreement, save it in its own subfolder within Y:/NRAS/soil-health-initiative/state-of-the-soils/data-sharing. If this agreement is part of a grant, place a copy in its corresponding grant subfolder within Y:/NRAS/soil-health-initiative/contracts-grants/grants. Include relevant correspondence, code to subset the requested data, and final dataset in the agreement’s subfolder in the data-sharing folder. This documentation allows us to track publications, attributions, and the broader impact of the SOS dataset.

8.4 Public access

Currently, the only publicly available SOS data are the counts of samples across the project, counties, and crop types displayed in the ArcGIS Online Dashboard.

A small, anonymized subset is included as example data in the {washi} and {soils} R packages for demonstration purposes.

When the data are more mature and hosted in a proper database, we may publish an anonymized subset in a public repository such as:

  • GitHub via an R package or Shiny app
  • Zenodo: integrates with GitHub and is citable with a DOI
  • Data.WA.gov: open data portal for the State of Washington

More inspiration for enhancing data discoverability and sharing across the agricultural and soil health communities include:

Chapter 16 of Data Management in Large-Scale Education Research discusses more considerations for data sharing and choosing public repositories (Lewis 2023).

8.5 Acknowledgments

All research and data partially or completely funded by WaSHI must include acknowledgements to the State of Washington. The following text should be included in all publications resulting from this funding:

Data were in part provided by the Washington Soil Initiative, which is supported by the State of Washington and administered by the Washington State Department of Agriculture, Washington State Conservation Commission, and Washington State University.

If WaSHI staff make substantial scientific contributions to the manuscript, discuss the possibility of co-authorship credit.