Research workflow - some guidelines from designing a study to publication
Hi there!
I was recently invited to have a chat with Agronomy graduate students and share some thoughts on data analysis.
I decided to write this post to go along with that conversation, so people have a written resource after the chat, and also so I have a space to flush out some ideas. This is not intended to be a thorough list of tips and advice, but to reflect some of the most important points I came across in my own grad school experience.
The original discussion agenda included the following main topics:
Designing/planning an experiment
Data Management
Data Analysis
Publication
This post will address topics #1, 3, and 4, as topic #2 received its own post here.
Let’s get into it.
1 Designing/planning an experiment
We have two possible case scenarios here:
a. Your project was already designed and planned before you joined the lab
In this case, all there is left for you is to learn about the project. If available, ask to have a copy of the proposal of the project, as that will include all details you need to know.
If a formal proposal is not available/accessible, then ask your advisor to provide you with some information like:
What is the knowledge gap that your research is trying to fill?
What are the hypotheses behind your project?
What are the objectives of the project?
What is the treatment design (the part of the design that addresses the study hypotheses) of your study?
What is the timeline of your project?
What are the expected deliverables of your project?
b. You have a clean slate to design your own project
In this case, to a certain extent, you have the freedom to propose a study and plan it from scratch. How nice!
The timeline here should be something like the following:
Identify the knowledge gap based on the literature.
Define clear hypotheses about your topic.
Define objectives based on your hypotheses.
Think about what treatments you should select to properly answer your hypotheses. The answer to this question will make up the treatment design of your study.
Think about what limitations on experimental material you may have. The answer to this question will determine the experimental design of your study.
For a more in-depth discussion on the differences between treatment and experimental design, check this post.
3 Data Analysis
a. How do we use statistical design and analysis to further understand the story of our data?
Proper statistical design and analysis is critical in understanding your data in the most unbiased way possible.
That is because your treatment design will help you to directly answer your objectives and hypotheses (assuming they were selected properly), while your experimental design will control for experimental material heterogeneity previously identified and incorporated in it (that otherwise would be noise in your data).
Therefore, it is critical that you understand how to translate what happened on the field (related to design) to statistical programming language.
Always remember that the statistical software, whichever you may be using, does not know what design you used, and will ALWAYS give an answer, regardless of how different the model is from what you designed on the field.
Thus, it is up to you (the researcher) to properly translate field design to statistical programming language.
Wrongly specified models can give you erroneous results, especially standard errors, ANOVA p-values, and pairwise comparison p-values!
b. Can you go ‘too far’ in statistical analysis when trying to understand data/story from the data?
Yes. That has been referred in the literature as “p-hacking” or “researcher degrees of freedom”. For a thorough read on this, I refer your to
Gelman, A. & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.
which can be read here.
In a nutshell, we should strive to define protocols for data cleaning/processing, statistical analyses types, model specifications (based on your study design), alpha values, adjustment choices, and anything else data-analysis-related AHEAD OF TIME.
If you end up changing plans as you progress on your analysis, ask yourself “Is this new choice simply to make my results significant?”. Hopefully you will have a great reason that will not simply be to make your results “better” whatever that may mean.
Also, defining clear hypotheses and objectives early in the process will help that you start off in the right direction, instead of “shooting” into multiple directions that could end up answering a different objective.
c. How to interpret your analysis?
Analyses results should always be interpreted based on a statistical test, and better yet if accompanied by some sort of variability measure.
Generally, model means (in case of ANOVAs) and coefficients (in case of regressions) are presented alone, followed only by a significance test result (like a letter or asterisk indicating significance at certain alpha value).
While this is ok, your results will be much better presented if you include standard errors, confidence interval, etc.
Also, plots that include both analysis results (e.g., mean, standard error, letter) AND raw data are the best option for communicating your data clearly. Plot types that can be used here include boxplots, violin plots, density plots, bee-swarm plots, and others.
For some more resources on this, just google “ggplot density plot” (as an example) and you will have a plethora of cases to use.
4. Publication
For me, the best publications (as related to data/analysis) are the ones that
- Have clear objectives
- Present figures/tables that answer the objectives
- Figures/tables follow a flow that makes sense to the reader, especially if they are connected or inter-dependent
So often I see/review papers that show a figure in results that could totally be removed and nothing would have been missed in addressing the objectives.
We can ask ourselves “Does this figure/table directly or indirectly move the paper narrative closer to answering an objective?”. If the answer is “No”, just remove it altogether, or place it in Supplementary Materials.
Sometimes the analyses and their related figures on a paper follow a flow. For ex., you first perform an ANOVA to understand which treatments were significantly greater (in yield, for ex.), and only select those treatments to further run a regression that only includes high-yielding treatments. In such cases where the results of one analysis feed into the input of the next analysis, presenting these figures in the proper chronological order, using interconnected color schemes, and consistent treatment naming make a big impact on the flow of the paper, and thus on the ease of interpretability to a reader.
Those were my thoughts!
Hope you enjoyed the read, and let me know if you have any other strategies or thoughts related to this topic!
Thanks, see you next time!
Leo.