|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102 |
- # tidyflow: a workflow that fits the tidyverse
-
- Tidyflow is *not* a package, but a project skeleton that you can clone/fork to start your own projects.
-
- It follows the project structure proposed by Hadley Wickham in [R for Data Science](http://r4ds.had.co.nz/)
-
- ![](http://r4ds.had.co.nz/diagrams/data-science.png)
-
- *Image under [CC-BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/3.0/us/)*
-
- ## Install
-
- If you are on github, simply fork the repo.
-
- If you don't want to use github as your remote, clone the depo in a new directory
-
- `git clone https://www.github.com/maximewack/tidyflow new_project`
-
- Then change the git remote origin to your own remote repo.
-
- `git remote set-url your_repo_url`
-
- The project already contains a *.gitignore* file for R projects.
- Add rules for your data files if you don't want them to be shared.
-
- Also run `install.packages(c("tidyverse", "rmarkdown", "knitr"))` to install the necessary dependencies.
-
- ## Directory structure
-
- The project contains three subdirectories: **Data/**, **Docs/** and **Rmd/**.
- **Data/** also contains a **Raw/** subdirectory.
-
- **Data/Raw/** should contain the raw data when they exist as files (csv, xls(x), SQLite databases, SAS files, SPSS files, etc.).
-
- **Docs/** should contain all external documents you have about the project (synopsis, context articles/presentations, etc.)
-
- **Rmd/** will contain the files used to communicate the results.
-
- ## Scripts
-
- Four scripts are already present, populated with boilerplate code for each of the steps.
- Packages `dplyr`, `magrittr`, `tidyr`, and `purrr` can be useful all the way.
-
- **Every step makes a "savepoint" of your work, allowing you to rapidly iterate on any of the steps without having to re-run the previous ones (unless you've changed something up in the chain).**
-
- ### 01_Import.R
-
- The first script is used to import raw data (whatever the source) and save a local csv copy in **Data/**.
- Useful packages from the tidyverse here are `readr`, `readxl`, `rvest`, `haven`, and `jsonlite`.
-
- Having the data ready as simple csv is useful to always be able to start from the beginning, even if the original source is unavailable.
-
- ### 02_Tidy.R
-
- This step consists mostly of "non-destructive" data management: assign types to columns (factors with correct/human readable levels, dates, etc.), correct/censor obviously abnormal values and errors), transform between *long* and *wide* format, etc.
- Useful packages here are `lubridate`, `stringr`, and `forcats`.
-
- The results are saved in a **tidy.Rdata** file.
-
- After this second step, you will have your full data ready to use in R and shouldn't have to run the first two steps anymore (unless you get hold of new data).
-
- ### 03_Transform.R
-
- This script is for data transforming. It will contain all transformations of the data to make them ready for analyses.
- Some "destructive" data management can occur here, such as dropping variables or observations, or modifying the levels of some factors.
- Useful packages here are `forcats`, `lubridate`, and `stringr`.
-
- The results are saved in a **transformed.Rdata** file.
-
- ### 04_Analyze.R
-
- This script will contain more data transforming, and the analyses with production of the resulting tables and plots.
- There is a bit of an overlap between **03_Transform.R** and **04_Analyze.R** as it is often an iterative process. Both files can be merged into one, but it can be useful to have some time-consuming transformations in a separate script and have the results handy.
- Useful packages here are `broom`, `ggplot2`, and `modelr`.
-
- In this script *all* the "interesting results," full tables and ggplot graphs are included in a single hierarchical list, saved in a **results.Rdata** file.
- All the results from the analyses should be saved as-is without transformation, so that every result can be used in the Rmd.
- Having all the results pre-computed for the Rmd means that it will take mere seconds to re-compile, while still having access to all the results if you want/need to use them somewhere in the manuscript/report.
-
- The results object can look like this:
-
- ```
- results
- ├─ tables
- │ ├─ demographics
- │ ├─ ttt_vs_control
- │ └─ table3
- ├─ list_of_interesting_values
- ├─ interesting_values2
- └─ plots
- ├─ figure1
- ├─ figure2
- └─ figure3
- ```
-
- ### Rmd/report.Rmd
-
- The Rmd file should not contain *any* literal values: every number, table, graph *has* to come from the results object (in its original form).
- Only some really minor cosmetic modifications should be made then (running `prettyNum` on numerics or table columns, `select`/`filter`/`arrange`/`rename` on the full tables, etc.)
- Multiple Rmds can be made using the same results: one for a full blown scientific article, one for a quick report, one for a presentation, etc.
-
- You will never have to check again for discrepancies between tables/figures and text, or even between different media.
|