8 Data sharing
Our data sharing policies promote FAIR principles so our data are “as open as possible, as closed as necessary” (European Commission 2016). Data should be open enough to facilitate efficient re-use; avoid duplicating data collection efforts; enhance scholarly rigor; and promote engagement across the research and public communities (Whyte and Pryor 2011). However, data must also be closed enough to protect grower privacy and to honor agreements with growers and other researchers (Czarnecki and Jones 2022; Korzekwa 2023).
The SOS relies on growers’ willingness to volunteer their fields for sampling and participate in the required management survey. Their willingness depends on their trust in WaSHI to protect their privacy. Only aggregated and anonymized results will be publicly available or shared. The below data privacy statement may be shared with potential participants.
Data privacy statement
All survey responses will be combined across all respondents. Results will not be reported in a way that makes individuals identifiable. Information collected in this survey are subject to release in accordance with RCW 42.56 (Public Records Act).
Procedures for anonymizing data are detailed in Section 8.2.
8.1 WaTech data categorization
Under Washington State Policy 141.10 (Securing Information Technology Assets), state agencies must classify data into categories based on the sensitivity of the data. WaTech provides guidance on the four categories of data.
Category 4: “Confidential information requiring special handling”
WaTech: Data requires strict handling requirements applied by statues or regulations.
SOS: Not applicable.
Category 3: “Confidential information”
WaTech: Data includes “personal information” as defined in RCW 42.56.590 (Security Breaches) and RCW 19.255.010 (Personal Information Disclosure). An individual’s first name or first initial and last name in combination with at least one of the following elements: social security number, driver’s license or Washington identification card number, or any account numbers that permit access to their financial account.
SOS: We do not collect the above elements in combination with grower names. However, we treat individual names, farm names, and latitude and longitude coordinates as confidential information.
Category 2: “Sensitive information”
WaTech: Data are intended for official use only and withheld unless specifically requested.
SOS: This category includes lab results and management surveys. Access to this data requires a data share agreement.
Category 1: “Public information”
WaTech: Data is not covered in any of the above categories or is already released to the public.
SOS: De-identified and aggregated data, such as the number of soil samples and from which counties and crops they were collected, fall under this category. For example, the SOS dashboard is publicly available and the map zoom is disabled at the 1:1,600,000 scale (counties level).
8.2 Maintain confidentiality
Only under special circumstances and with proper justification in the data share agreement will the following Category 3 data be released to external collaborators. Under no circumstances will these data be made publicly available.
- Farm name
- Grower first and last name
- Field names that contain street names or other identifying information
- Latitude and longitude coordinates or other geospatial identifiers
- Any information identifying the individual farm or grower
Anonymize and aggregate Category 2 data to honor our data privacy statement by either removing or replacing Category 3 confidential information with dummy data. The {randomNames} R package can be used to replace real names with fake names. Round latitude and longitude to a precision that does not identify the farm or fields sampled.
See the example R script below, copied from the private washi-sos
GitHub repository.
R code to anonymize data
# Attach packages ==============================================================
library(dplyr)
library(readxl)
library(janitor)
library(randomNames)
# Load data ====================================================================
<- readRDS(
farm "data/sos-sample-metadata.RDS"
|>
) select(
c(sample_id, farm_name, producer_name, producer_id, field_name, field_id)
)
<- readRDS(
results_wide "data/sos-lab-results-wide.RDS"
|>
) select(-c(
project_id,:lab_analyzed_date,
lab_id
crop_type,
crop_variety,
texture_class
))
<- readRDS(
points "data/sos-sample-locations.RDS"
|>
) select(sample_id, longitude, latitude) |>
mutate(across(where(is.numeric), \(x) round(x, 0)))
# Join farm info with lab data
<- left_join(farm, results_wide) |>
results_wide relocate(year, .before = sample_id) |>
rename(crop = crop_group)
# Join lab data with gis data
<- left_join(results_wide, points) |>
data remove_empty("cols") |>
relocate(latitude, longitude, .after = county)
# Remove WSU SCBG samples and 0-6/6-12 in WSDA samples
<- data |>
data subset(!year %in% c(2020, 2021) &
!grepl("[A-Z]", field_id)) |>
remove_empty("cols") |>
select(-c(pmn_nitrate_n_mg_kg:min_c_24hr_mg_c_kg_day))
# Anonymize
set.seed(123)
<- data |>
anon_data # Farm name, producer name, sample ID
mutate(
farm_name = paste0("Farm ", sprintf("%03d", cur_group_id())),
producer_name = randomNames(
which.names = "first",
sample.with.replacement = FALSE
),producer_id = paste0(
paste0(sample(LETTERS, 3, replace = TRUE), collapse = ""),
"0",
paste0(sample(1:9, 1, replace = TRUE), collapse = "")
),field_name = paste0("Field ", field_id),
sample_id = paste0(
substr(year, 3, 4),
"-",
producer_id,"-",
field_id
),.by = c(county, producer_name, producer_id)
|>
) # county
mutate(
county = paste0("County ", cur_group_id()),
.by = c(county)
)
# Grab 100 random samples
<- anon_data |>
anon_subset slice_sample(n = 100) |>
arrange(sample_id)
write.csv(
anon_subset,file = "data/example-data.csv",
na = "",
row.names = FALSE
)
8.4 Public access
Currently, the only publicly available SOS data are the counts of samples across the project, counties, and crop types displayed in the ArcGIS Online Dashboard.
A small, anonymized subset is included as example data in the {washi} and {soils} R packages for demonstration purposes.
When the data are more mature and hosted in a proper database, we may publish an anonymized subset in a public repository such as:
- GitHub via an R package or Shiny app
- Zenodo: integrates with GitHub and is citable with a DOI
- Data.WA.gov: open data portal for the State of Washington
More inspiration for enhancing data discoverability and sharing across the agricultural and soil health communities include:
Chapter 16 of Data Management in Large-Scale Education Research discusses more considerations for data sharing and choosing public repositories (Lewis 2023).
8.5 Acknowledgments
All research and data partially or completely funded by WaSHI must include acknowledgements to the State of Washington. The following text should be included in all publications resulting from this funding:
Data were in part provided by the Washington Soil Initiative, which is supported by the State of Washington and administered by the Washington State Department of Agriculture, Washington State Conservation Commission, and Washington State University.
If WaSHI staff make substantial scientific contributions to the manuscript, discuss the possibility of co-authorship credit.