5  Storage & version control

Non-digital data, such as paper forms, must be transcribed or converted to digital file formats and then stored in the WaSHI filing cabinet in the Natural Resources Building in Olympia.

All digital data are stored in the WSDA shared drives, and other locations listed below.

WSDA shared drives:

Esri products and services:

Database for lab results and management data:

GitHub organizations for code-based projects:

Microsoft Teams for data sharing between WSDA and WSU:

Box.com for external file sharing:

Individual devices (laptop, tablet, phone):

5.1 Backup

Data must be stored in multiple locations. At minimum, data on an individual computer must also be saved on the WSDA shared drive. Backing up data using version control (GitHub) or a cloud service (Microsoft OneDrive or Box.com) is strongly recommended.

5.2 Read-only raw data

Always set raw data files, such as lab results or ArcGIS Online exports, as Read-Only to avoid accidental corruption or overwriting. For example, in the lab-data folder, all original data files are set to Read-Only and saved in the raw folder.

Copy the raw data file to the working folder for processing and analyses. Then save the final dataset in the separate clean folder with a descriptive title. Keeping a readme.txt to document processing steps is good practice, as discussed in Section 6.2.1.

Y:/NRAS/soil-health-initiative/state-of-the-soils/2023_sampling/lab-data
├── 2023_data-template-soiltest.xlsx
├── clean
├── qc
├── raw
└── working

To set a file as Read-Only: right-click the file > Properties > check the Read-only attribute box > OK.

Screenshot of the above directions to set files to Read-only on a Windows computer.

5.3 Version control with Git and GitHub

A version control system records changes to files over time. Git is a free and open-source distributed version control system. GitHub is the hosting site we use to interface with Git. Git and GitHub are fundamental to reproducible statistical and data scientific workflows (Bryan 2018).

Version control ensures changes are documented and previous versions are accessible if changes must be recalled. Additionally, version control enables robust collaboration across projects.

It’s useful for not only code projects, but also for documents, presentations, and books (like this DMP!). Git and GitHub automatically save the revision history of each file, so there is only a single name for each file (e.g., report.docx) instead of report_v01.docx and report_v02.docx. For a reminder on version naming, see Section 3.2.7).

The screenshot below shows who made commits (i.e., named version histories) and when they were made. From this screen, a user can click on the commit message to view all files that were changed.

Screenshot of GitHub commits for the WaSHI DMP.

After clicking the first commit message, a diff (i.e., a visual of what changed) displays the additions to documentation.qmd highlighted in green and deletions highlighted in red.

Screenshot of GitHub commit message titled 'First complete draft of documentation chapter' which shows the changed file with additions highlighted in green and deletions highlighted in red.

Privacy considerations

Review Chapter 8 to categorize the data included in the repository to protect grower privacy. If the data are not anonymized and aggregated, either 1) the repository must be set to private or 2) data files and any scripts containing Category 3 data as described in Section 8.1.0.2 must be added to the .gitignore file.

Git and GitHub resources

Read Jenny Bryan’s article Excuse Me, Do You Have a Moment to Talk About Version Control (open-access pre-print; full article on WSDA shared drive) for a background on Git and GitHub, why we should use it, and how to get started. For detailed instructions, follow along with her free online book Happy Git and GitHub for the useR.

GitHub: A Beginner’s Guide is a helpful resource created by Birds Canada for less advanced programmers. If you prefer to look through slides, see Byron C. Jaeger’s presentation Happier version control with Git and GitHub (and RStudio).