You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

392 lines
15KB

  1. ---
  2. title: "desctable usage vignette"
  3. output: rmarkdown::html_vignette
  4. vignette: >
  5. %\VignetteIndexEntry{desctable usage}
  6. %\VignetteEngine{knitr::rmarkdown}
  7. %\VignetteEncoding{UTF-8}
  8. ---
  9. ```{r, echo = F, message = F, warning = F}
  10. library(pander)
  11. library(DT)
  12. library(desctable)
  13. options(DT.options = list(#scrollX = T,
  14. info = F,
  15. search = F,
  16. dom = "Brtip",
  17. fixedColumns = T))
  18. knitr::opts_chunk$set(message = F, warning = F, screenshot.force = F)
  19. ```
  20. Desctable is a comprehensive descriptive and comparative tables generator for R.
  21. Every person doing data analysis has to create tables for descriptive summaries of data (a.k.a. Table.1), or comparative tables.
  22. Many packages, such as the aptly named **tableone**, address this issue. However, they often include hard-coded behaviors, have outputs not easily manipulable with standard R tools, or their syntax are out-of-style (e.g. the argument order makes them difficult to use with the pipe (`%>%`)).
  23. Enter **desctable**, a package built with the following objectives in mind:
  24. * generate descriptive and comparative statistics tables with nesting
  25. * keep the syntax as simple as possible
  26. * have good reasonable defaults
  27. * be entirely customizable, using standard R tools and functions
  28. * produce the simplest (as a data structure) output possible
  29. * provide helpers for different outputs
  30. * integrate with "modern" R usage, and the **tidyverse** set of tools
  31. * apply functional paradigms
  32. ----
  33. # Descriptive tables
  34. ## Simple usage
  35. **desctable** uses and exports the pipe (`%>%`) operator (from packages **magrittr** and **dplyr** fame), though it is not mandatory to use it.
  36. The single interface to the package is its eponymous `desctable` function.
  37. When used on a data.frame, it returns a descriptive table:
  38. ```{r}
  39. iris %>%
  40. desctable
  41. desctable(mtcars)
  42. ```
  43. As you can see with these two examples, `desctable` describes every variable, with individual levels for factors. It picks statistical functions depending on the type and distribution of the variables in the data, and applies those statistical functions only on the relevant variables.
  44. ## Output
  45. The object produced by `desctable` is in fact a list of data.frames, with a "desctable" class.
  46. Methods for reduction to a simple dataframe (`as.data.frame`, automatically used for printing), conversion to markdown (`pander`), and interactive html output with **DT** (`datatable`) are provided:
  47. ```{r}
  48. iris %>%
  49. desctable %>%
  50. pander
  51. mtcars %>%
  52. desctable %>%
  53. datatable
  54. ```
  55. <br>
  56. You need to load these two packages first (and prior to **desctable** for **DT**) if you want to use them.
  57. Calls to `pander` and `datatable` with "regular" dataframes will not be affected by the defaults used in the package, and you can modify these defaults for **desctable** objects.
  58. Subsequent outputs in this vignette section will use **DT**. The `datatable` wrapper function for desctable objects comes with some default options and formatting such as freezing the row names and table header, export buttons, and rounding of values. Both `pander` and `datatable` wrapper take a *digits* argument to set the number of decimals to show. (`pander` uses the *digits*, *justify* and *missing* arguments of `pandoc.table`, whereas `datatable` calls `prettyNum` with the `digits` parameter, and removes `NA` values. You can set `digits = NULL` if you want the full table and format it yourself)
  59. ## Advanced usage
  60. `desctable` chooses statistical functions for you using this algorithm:
  61. * always show N
  62. * if there are factors, show %
  63. * if there are normally distributed variables, show Mean and SD
  64. * if there are non-normally distributed variables, show Median and IQR
  65. For each variable in the table, compute the relevant statistical functions in that list (non-applicable functions will safely return `NA`).
  66. How does it work, and how can you adapt this behavior to your needs?
  67. `desctable` takes an optional *stats* argument. This argument can either be:
  68. * an automatic function to select appropriate statistical functions
  69. * or a named list of
  70. * statistical functions
  71. * formulas describing conditions to use a statistical function.
  72. ### Automatic function
  73. This is the default, using the `stats_auto` function provided in the package.
  74. Several other "automatic statistical functions" are defined in this package: `stats_auto`, `stats_default`, `stats_normal`, `stats_nonnormal`.
  75. You can also provide your own automatic function, which needs to
  76. * accept a dataframe as its argument (whether to use this dataframe or not in the function is your choice), and
  77. * return a named list of statistical functions to use, as defined in the subsequent paragraphs.
  78. ```{r}
  79. # Strictly equivalent to iris %>% desctable %>% datatable
  80. iris %>%
  81. desctable(stats = stats_auto) %>%
  82. datatable
  83. ```
  84. ### Statistical functions
  85. Statistical functions can be any function defined in R that you want to use, such as `length` or `mean`.
  86. The only condition is that they return a single numerical value. One exception is when they return a vector of length `1 + nlevels(x)` when applied to factors, as is needed for the `percent` function.
  87. As mentioned above, they need to be used inside a named list, such as
  88. ```{r}
  89. mtcars %>%
  90. desctable(stats = list("N" = length, "Mean" = mean, "SD" = sd)) %>%
  91. datatable
  92. ```
  93. <br>
  94. The names will be used as column headers in the resulting table, and the functions will be applied safely on the variables (errors return `NA`, and for factors the function will be used on individual levels).
  95. Several convenience functions are included in this package. For statistical function we have: `percent`, which prints percentages of levels in a factor, and `IQR` which re-implements `stats::IQR` but works better with `NA` values.
  96. Be aware that **all functions will be used on variables stripped of their `NA` values!**
  97. This is necessary for most statistical functions to be useful, and makes **N** (`length`) show only the number of observations in the dataset for each variable.
  98. ### Conditional formulas
  99. The general form of these formulas is
  100. ```{r, eval = F}
  101. predicate_function ~ stat_function_if_TRUE | stat_function_if_FALSE
  102. ```
  103. A predicate function is any function returning either `TRUE` or `FALSE` when applied on a vector, such as `is.factor`, `is.numeric`, and `is.logical`.
  104. **desctable** provides the `is.normal` function to test for normality (it is equivalent to `length(na.omit(x)) > 30 & shapiro.test(x)$p.value > .1`).
  105. The *FALSE* option can be omitted and `NA` will be produced if the condition in the predicate is not met.
  106. These statements can be nested using parentheses.
  107. For example:
  108. `is.factor ~ percent | (is.normal ~ mean)`
  109. will either use `percent` if the variable is a factor, or `mean` if and only if the variable is normally distributed.
  110. You can mix "bare" statistical functions and formulas in the list defining the statistics you want to use in your table.
  111. ```{r}
  112. iris %>%
  113. desctable(stats = list("N" = length,
  114. "%/Mean" = is.factor ~ percent | (is.normal ~ mean),
  115. "Median" = is.normal ~ NA | median)) %>%
  116. datatable
  117. ```
  118. <br>
  119. For reference, here is the body of the `stats_auto` function in the package:
  120. ```{r, echo = F}
  121. print(stats_auto)
  122. ```
  123. ### Labels
  124. It is often the case that variable names are not "pretty" enough to be used as-is in a table.
  125. Although you could still edit the variable labels in the table afterwards using subsetting or string replacement functions, it is possible to mention a **labels** argument.
  126. The **labels** argument is a named character vector associating variable names and labels.
  127. You don't need to provide labels for all the variables, and extra labels will be silently discarded. This allows you to define a "global" labels vector and use it for every table even after variable selections.
  128. ```{r}
  129. mtlabels <- c(mpg = "Miles/(US) gallon",
  130. cyl = "Number of cylinders",
  131. disp = "Displacement (cu.in.)",
  132. hp = "Gross horsepower",
  133. drat = "Rear axle ratio",
  134. wt = "Weight (1000 lbs)",
  135. qsec = "¼ mile time",
  136. vs = "V/S",
  137. am = "Transmission",
  138. gear = "Number of forward gears",
  139. carb = "Number of carburetors")
  140. mtcars %>%
  141. dplyr::mutate(am = factor(am, labels = c("Automatic", "Manual"))) %>%
  142. desctable(labels = mtlabels) %>%
  143. datatable
  144. ```
  145. <br>
  146. ----
  147. # Comparative tables
  148. ## Simple usage
  149. Creating a comparative table (between groups defined by a factor) using `desctable` is as easy as creating a descriptive table.
  150. It uses the well known `group_by` function from **dplyr**:
  151. ```{r}
  152. iris %>%
  153. group_by(Species) %>%
  154. desctable -> iris_by_Species
  155. iris_by_Species
  156. ```
  157. The result is a table containing a descriptive subtable for each level of the grouping factor (the statistical functions rules are applied to each subtable independently), with the statistical tests performed, and their p values.
  158. When displayed as a flat dataframe, the grouping header appears in each variable.
  159. You can also see the grouping headers by inspecting the resulting object, which is a deep list of dataframes, each dataframe named after the grouping factor and its levels (with sample size for each).
  160. ```{r}
  161. str(iris_by_Species)
  162. ```
  163. You can specify groups based on any variable, not only factors:
  164. ```{r}
  165. # With pander output
  166. mtcars %>%
  167. group_by(cyl) %>%
  168. desctable %>%
  169. pander
  170. ```
  171. Also with conditions:
  172. ```{r}
  173. # With datatable output
  174. iris %>%
  175. group_by(Petal.Length > 5) %>%
  176. desctable %>%
  177. datatable
  178. ```
  179. <br>
  180. And even on multiple nested groups:
  181. ```{r, message = F, warning = F}
  182. mtcars %>%
  183. dplyr::mutate(am = factor(am, labels = c("Automatic", "Manual"))) %>%
  184. group_by(vs, am, cyl) %>%
  185. desctable %>%
  186. datatable
  187. ```
  188. <br>
  189. In the case of nested groups (a.k.a. sub-group analysis), statistical tests are performed only between the groups of the deepest grouping level.
  190. Statistical tests are automatically selected depending on the data and the grouping factor.
  191. ## Advanced usage
  192. `desctable` choses the statistical tests using the following algorithm:
  193. * if the variable is a factor, use `fisher.test`
  194. * if the grouping factor has only one level, use the provided `no.test` (which does nothing)
  195. * if the grouping factor has two levels
  196. * and the variable presents homoskedasticity (p value for `var.test` > .1) and normality of distribution in both groups, use `t.test(var.equal = T)`
  197. * and the variable does not present homoskedasticity (p value for `var.test` < .1) but normality of distribution in both groups, use `t.test(var.equal = F)`
  198. * else use `wilcox.test`
  199. * if the grouping factor has more than two levels
  200. * and the variable presents homoskedasticity (p value for `bartlett.test` > .1) and normality of distribution in all groups, use `oneway.test(var.equal = T)`
  201. * and the variable does not present homoskedasticity (p value for `bartlett.test` < .1) but normality of distribution in all groups, use `oneway.test(var.equal = F)`
  202. * else use `kruskal.test`
  203. But what if you want to pick a specific test for a specific variable, or change all the tests altogether?
  204. `desctable` takes an optional *tests* argument. This argument can either be
  205. * an automatic function to select appropriate statistical test functions
  206. * or a named list of statistical test functions
  207. ### Automatic function
  208. This is the default, using the `tests_auto` function provided in the package.
  209. You can also provide your own automatic function, which needs to
  210. * accept a variable and a grouping factor as its arguments, and
  211. * return a single-term formula containing a statistical test function.
  212. This function will be used on every variable and every grouping factor to determine the appropriate test.
  213. ```{r}
  214. # Strictly equivalent to iris %>% group_by(Species) %>% desctable %>% datatable
  215. iris %>%
  216. group_by(Species) %>%
  217. desctable(tests = tests_auto) %>%
  218. datatable
  219. ```
  220. <br>
  221. ### List of statistical test functions
  222. You can provide a named list of statistical functions, but here the mechanism is a bit different from the *stats* argument.
  223. The list must contain either `.auto` or `.default`.
  224. * `.auto` needs to be an automatic function, such as `tests_auto`. It will be used by default on all variables to select a test
  225. * `.default` needs to be a single-term formula containing a statistical test function that will be used on all variables
  226. You can also provide overrides to use specific tests for specific variables.
  227. This is done using list items named as the variable and containing a single-term formula function.
  228. ```{r}
  229. iris %>%
  230. group_by(Petal.Length > 5) %>%
  231. desctable(tests = list(.auto = tests_auto,
  232. Species = ~chisq.test)) %>%
  233. datatable
  234. ```
  235. <br>
  236. ```{r}
  237. mtcars %>%
  238. dplyr::mutate(am = factor(am, labels = c("Automatic", "Manual"))) %>%
  239. group_by(am) %>%
  240. desctable(tests = list(.default = ~wilcox.test,
  241. mpg = ~t.test)) %>%
  242. datatable
  243. ```
  244. <br>
  245. You might wonder why the formula expression. That is needed to capture the test name, and to provide it in the resulting table.
  246. As with statistical functions, any statistical test function defined in R can be used.
  247. The conditions are that the function
  248. * accepts a formula (`variable ~ grouping_variable`) as a first positional argument (as is the case with most tests, like `t.test`), and
  249. * returns an object with a `p.value` element.
  250. Several convenience function are provided: formula versions for `chisq.test` and `fisher.test` using generic S3 methods (thus the behavior of standard calls to `chisq.test` and `fisher.test` are not modified), and `ANOVA`, a partial application of `oneway.test` with parameter *var.equal* = T.
  251. # Tips and tricks
  252. In the *stats* argument, you can not only feed function names, but even arbitrary function definitions, functional sequences (a feature provided with the pipe (`%>%`)), or partial applications (with the **purrr** package):
  253. ```{r}
  254. mtcars %>%
  255. desctable(stats = list("N" = length,
  256. "Sum of squares" = function(x) sum(x^2),
  257. "Q1" = . %>% quantile(prob = .25),
  258. "Q3" = purrr::partial(quantile, probs = .75))) %>%
  259. datatable
  260. ```
  261. <br>
  262. In the *tests* arguments, you can also provide function definitions, functional sequences, and partial applications in the formulas:
  263. ```{r}
  264. iris %>%
  265. group_by(Species) %>%
  266. desctable(tests = list(.auto = tests_auto,
  267. Sepal.Width = ~function(f) oneway.test(f, var.equal = F),
  268. Petal.Length = ~. %>% oneway.test(var.equal = T),
  269. Sepal.Length = ~purrr::partial(oneway.test, var.equal = T))) %>%
  270. datatable
  271. ```
  272. <br>
  273. This allows you to modulate the behavior of `desctable` in every detail, such as using paired tests, or non *htest* tests.
  274. ```{r}
  275. # This is a contrived example, which would be better solved with a dedicated function
  276. library(survival)
  277. bladder$surv <- Surv(bladder$stop, bladder$event)
  278. bladder %>%
  279. group_by(rx) %>%
  280. desctable(tests = list(.default = ~wilcox.test,
  281. surv = ~. %>% survdiff %>% .$chisq %>% pchisq(1, lower.tail = F) %>% list(p.value = .))) %>%
  282. datatable
  283. ```