5 things I wish I knew in my first semester of grad school - Data organization
Hi there!
I was recently invited to have a chat with Agronomy graduate students and share some thoughts on data analysis.
I decided to write this post to go along with that conversation, so people have a written resource after the chat, and also so I have a space to flush out some ideas. This is not intended to be a thorough list of tips and advice, but to reflect some of the most important points I came across in my own grad school experience.
This post will focus on a topic that comes before data analysis, but that can cause you a lot of headaches if goes without proper planing: data organization.
tl;dr
- Think of and define your main folder structure (site, year, study)
- Avoid file clutter, have a sub-folder organization in place
- Design a file naming scheme that makes sense for your case and makes your life easier in the long-run
- Name your column names to be short, intuitive, and complete
- Use a cloud service to store all your project files
1 Folder organization
The number of data files you collect and produce will only increase over time. Take a minute to think of a structure that works best for you. There is really no right or wrong answer, each case is specific and you should choose whichever strategy best fit your needs.
Questions to ask yourself:
Should I separate folders by response variable (e.g., yield, pest density, soil nutrients), or by space-time (e.g., name of site, year the studies were conducted)?
If separating by space-time, should I have a separate folder for each site, and have the data for all years of that site in the same folder? (e.g., main folder named Prairie, with subfolders named 2019, 2020, 2021).
Or maybe should I have a folder for each year, and have the data for all sites of that year in the same folder? (e.g., main folder named 2019, with subfolders named Prairie, Farm, Creek).
Does it make sense to have a folder for each separate site-year combination? (e.g., main folder named Prairie-2019).
Or do I need one overall folder with the data from all site-years together? (e.g., main folder named allprojects).
The way you will analyze the data have a big impact on how you set your folder structure.
If you think you will analyze all data together, like running a multi-site-year ANOVA model, then makes sense to keep everything in the same folder.
In contrast, if you are only interested in modeling each site-year separately, then having a main folder for each site-year makes more sense.
2 Sub-folder organization
Now that you defined your folder structure, I urge you to think about how files will live in those folders.
Normally we just drop all files related to a project into that main folder.
We end up with preliminary data, processed data, figures, tables, code, word files, all mixed together. We can do better!
The structure I use that serves me well is to have three main sub-folders:
data: this is where all original and processed data files (e.g., excel, csv, tiff) will live.
code: this is where all code files (e.g., .R, .md. .Rmd) will live.
output: this is where the outputs from your data (e.g., summary tables, figures) will live.
This structure keeps my projects organized and easy to browse.
3 File naming
File naming is a big component of our research organization that we normally only realize we need when we are way too far into the issue.
Thus, the earlier you can think of it and plan ahead, the more your mid-way-through-grad-school self will thank your just-starting-grad-school self.
Things to consider on your naming scheme include date, location, response variable, other specific differentiatiors.
- date: if you are collecting data on multiple dates and each has their own data file. If your collection dates span more than one year, you can start your file name with the format YYYY-MM-DD, e.g., 2021-10-23.
This is the best way to organize multi-year dates, as other formats (e.g., MM-DD, MM-DD-YYYY) will not alphabetically sort the files in the correct chronological order.
location: if you have different files for different locations, then including location name is important.
response variable: if you have a file for each response variable and they are all located in the same sub-folder, then including response variable in the file name will help you.
other specific differentiators: as the name suggests, these would be other differentiating variables that are important in your case. Think of any variable that would help you to further differentiate two datasets that have the same name up until here, but contain different data. Examples of this could be soil depth, block number, etc.
An example of file naming could be:
2021-10-23_Prairie_Imagery_morning.tiff
4 Column naming
Column names in datasets should be the shortest possible yet intuitive and complete (including units!).
Did that sound confusing? Let’s look at an example.
Say I collect soil samples and analyze them for phosphorus (abbreviated as P), and get results back in ppm.
I could name this column in multiple ways, let’s explore a bad and a better example:
soilPhosporus vs. P_ppm
Notice how the second version is shorter, intuitive, and complete (includes units).
The inclusion of units is VERY important! Especially in the US, data may be collected and recorded using imperial units, though there is a chance they were recorded using metric units. Without explicitly writing the unit, you are prone to guessing. Avoid guessing, write your units!
Short, intuitive, and complete column names will be a HUGE benefit when you start analyzing and interpreting your data. Writing out
lm(P_ppm ~ Prate_kgha)
is a lot more efficient and less error-prone than writing out
lm(soilPhosphorus ~ Phosphorusrate)
5 Data storage
Now that you have decided how your folders, sub-folders, file and column names are organized, let’s talk about where will all of that live.
It is not uncommon to hear about cases where people had their data files saved only in their laptop, and it got stolen, broken, unrepairable, etc. What a nightmare!
To avoid this happening to you, plan on storing your research files (like, all of them!) in some sort of cloud service.
Universities normally have a contract with a cloud storing service that provides LARGE amounts of storage space. If you don’t want to be “attached” to a service provided by your current university for the long-term, there are multiple free options (free plans have limited storage, but still a lot, with the option to pay and have storage capacity increased) out there, including:
- Google’s Backup and Sync (a.k.a Google Drive)
- Microsoft’s OneDrive
- Dropbox
- GitHub
GitHub has the added benefit of version control (like track changes, but for your data, code, and other files) and collaborative development (of analysis, writing, etc.). GitHub is not as straightforward as other examples above and has a learning curve, but it is REALLY worth it for your career.
All of these services have a desktop application that you can download that allows you to use, open, edit, and share files as you would normally, and yet always be synced with the cloud.
Anyways, the bottom line is choose one cloud storage service and USE IT, NOW! Don’t wait until you run into a big issue that could compromise your graduate program!
Hope this was helpful!
Let me know if I forgot of something important, or if you use any other strategy to keep your data organized.
Cheers!
Leo.