Using the apastats package to write reproducible reports
Overview
Writing a research report using RMarkdown allows you to start your analysis even before you have collected the data. This is especially helpful in report writing and thesis writing. When you, web-crawler, or a panel are still collecting data, you can already start writing the thesis. However, writing results before having the data is somewhat challenging. Keep in mind that writing a discussion before having collected the data is a lot harder.
What we’re going to need two things:
- a package that formats output in a way that we need it. In our case
apastats
. - string manipulation:
stringr
andglue
Installing apastats
You can install the apastats
package using the following command:
devtools::install_github('achetverikov/apastats',subdir='apastats')
Now, we can start. Let’s assume we have already collected the following data about iris flowers 😉.
df <- iris %>% sample_n(5)
df %>% kable()
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.8 | 2.7 | 4.1 | 1.0 | versicolor |
6.4 | 2.8 | 5.6 | 2.1 | virginica |
4.4 | 3.2 | 1.3 | 0.2 | setosa |
4.3 | 3.0 | 1.1 | 0.1 | setosa |
7.0 | 3.2 | 4.7 | 1.4 | versicolor |
The analysis
Deciding on our hypothesis
We have found in literature that the setosa species has a rather small petal length of only about 3-4 cm. Since we are sampling from random flowers we want to test whether our sample is also smaller than, let us say, 5cm. Thus, our null hypothesis would be \(H_0: PL > 5\).
Selecting a test
We would test this using a one-sample t-test on the subset of data.
res <- df %>%
filter(Species == "setosa") %>%
pull(Petal.Length) %>%
t.test(mu = 5)
res
##
## One Sample t-test
##
## data: .
## t = -38, df = 1, p-value = 0.01675
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
## -0.07062047 2.47062047
## sample estimates:
## mean of x
## 1.2
Well… This is not our final data yet, so we should not bee too happy about this result already. And this looks nothing like a report. How can we write the test and the interpretation that it incorporates the results?
Using apastats to describe
The describe.ttest
function returns the correctly formatted markdown for reporting a t-test. So the following would always generate the correct output.
tresult <- apastats::describe.ttest(res)
tresult
## [1] "_t_(1.0) = -38.00, _p_ = .017"
Meh. This is not really helpful either, right? We must include the text result as an inline result using the `r tresult`
. Then we can say:
Setosa iris have petal leaves whose length is statistically significant different from 5 (t(1.0) = -38.00, p = .017).
Preparing for both outcomes
This is nice, but we don’t know whether our sample of \(n=2\) actually represents the population. Therefore, we must prepare for both cases. Here the glue
package is helpful in filling in the blanks in pre-written output.
library(glue)
# test the p-value
alpha_level <- .05
# default is not significant
have_txt <- "do not have"
if (res$p.value < alpha_level) {
have_txt <- "have"
}
phrase <- glue("> Setosa iris {have_txt} petal leaves whose length is ",
"statistically significant different from 5 ({tresult}).")
This we can now output using: `r phrase`
.
Setosa iris have petal leaves whose length is statistically significant different from 5 (t(1.0) = -38.00, p = .017).
Collecting the final data and running our analysis
Now, we can run the full script again, but with the full data set. I have omitted unnecessary output in the following code chunk and directly construct the output phrase.
# run the test
res <- iris %>%
filter(Species == "setosa") %>%
pull(Petal.Length) %>%
t.test(mu = 5)
# describe results
tresult <- apastats::describe.ttest(res)
# generate phrase
have_txt <- "do not have"
if (res$p.value < alpha_level) {
have_txt <- "have"
}
# you can use markdown here as well
phrase <- glue("> Setosa iris {have_txt} petal leaves whose length is ",
"statistically significant different from 5 ({tresult}).")
The final outcome of our analysis now is:
Setosa iris have petal leaves whose length is statistically significant different from 5 (t(49.0) = -144.06, p < .001).