Using spreadsheet programs for scientific data

Data Carpentry

This lesson was recently cloned and expanded from Data Carpentry's "Better Spreadsheet Skills for Ecologists". Data Carpentry's aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain. The lessons below were designed for those interested in working with data in .

Good data organization is the foundation of your research project. Most researchers have data or do data entry in spreadsheets. Spreadsheet programs are very useful graphical interfaces for designing data tables and handling very basic data quality control functions.

Much of your time as a manager or researcher will be spent in this 'data wrangling' stage.

It could be manipulating data to get it in a database, or modifying output from a database to do something. Examples: Data summaries for a project or institutional report, database migration, preparing for analysis in another software, share a temporary list of taxon names for a student working in a collection, or a data list for sharing and visualizing content in your collection.

Regarding data migration: Spreadsheets are very useful to visualize snapshots of your data. Your data workflow for cleaning data will be different based on your data transformation issues, but there are a few first steps to help you organize your process:

  • Understand the data you are merging. Carefully define the fields from the export database and map to the import database.

  • Make a list of the data transformations/data cleaning you need to do. This document is very useful, as most data cleaning efforts occur more than once.

Data cleaning is iterative. Not between datasets, but on the same dataset. This is the reason we suggest repeatable methods for data cleaning.

  • Once you understand the data transformations, than you can discover the best way to make those happen.

If you have a lot to do and its very challenging, it might be time to ask for help or learn another method. If you know what you need to do very specifically, it becomes clearer and easier (cheaper) when you do reach out for help.

  • Delimited text files are used for moving data.

Spreadsheet outline

In this lesson, we’re going to talk about:

  • Good data entry practices - formatting data tables in spreadsheets
  • How to avoid common formatting mistakes
  • Dates as data - beware!
  • Basic quality control and data manipulation in spreadsheets
  • Exporting data from spreadsheets
  • Simple functions
  • Taming the iDigBio Portal

Spreadsheet programs

  • LibreOffice - used in these lessons because it's a free, open source spreadsheet program
  • Microsoft Excel
  • Gnumeric

Commands may differ a bit between programs, but general ideas for thinking about spreadsheets is the same. Group yourselves.

Spreadsheets encompass a lot of the things we need to be able to do as researchers. We can use them for:

  • Data entry
  • Organizing data
  • Subsetting and sorting data
  • Statistics
  • Plotting

In this lesson, we’re going to talk about:

  1. Formatting data in spreadsheets
  2. Formatting problems
  3. Dates as data
  4. Basic quality control and data manipulation in spreadsheets.
  5. Exporting data from spreadsheets.
  6. Data export formats caveats
  7. Functions
  8. Regular Expressions
  9. iDigbio Portal Exercise

Next: Formatting data tables in spreadsheets.