Você não pode selecionar mais de 25 tópicos Os tópicos devem começar com uma letra ou um número, podem incluir traços ('-') e podem ter até 35 caracteres.
Maxime Wack 65e7b590f8 Added licence for diagram 7 anos atrás
Data Empty files 7 anos atrás
Docs Empty files 7 anos atrás
Rmd Script files 7 anos atrás
.gitignore Initial commit 7 anos atrás
01_Import.R Script files 7 anos atrás
02_Tidy.R Script files 7 anos atrás
03_Transform.R Script files 7 anos atrás
04_Analyse.R Script files 7 anos atrás
LICENSE Initial commit 7 anos atrás
README.md Added licence for diagram 7 anos atrás

README.md

tidyflow: a workflow that fits the tidyverse

Tidyflow is not a package, but a project skeleton that you can clone/fork to start your own projects.

It follows the project structure proposed by Hadley Wickham in R for Data Science

Image under CC-BY-NC-ND

Install

git clone https://www.github.com/maximewack/tidyflow new_project

to clone the project into a new directory.
Don't forget to change the git remote origin to your own remote repo.

The project already contains a .gitignore file for R projects.

Also run install.packages(c("tidyverse", "rmarkdown", "knitr")) to install the necessary dependencies.

Directory structure

The project contains three subdirectories: Data/, Docs/ and Rmd/.
Data/ also contains a Raw/ subdirectory.

Data/Raw/ should contain the raw data when they exist as files (csv, xls(x), SQLite databases, SAS files, SPSS files, etc.).

Docs/ should contain all external documents you have about the project (synopsis, context articles/presentations, etc.)

Rmd/ will contain the files used to communicate the results.

Scripts

Four scripts are already present, populated with boilerplate code for each of the steps.
Packages dplyr, magrittr, tidyr, and purrr can be useful all the way.

Every step makes a “savepoint” of your work, allowing you to rapidly iterate on any of the steps without having to re-run the previous ones (unless you've changed something up in the chain).

01_Import.R

The first script is used to import raw data (whatever the source) and save a local csv copy in Data/.
Useful packages from the tidyverse here are readr, readxl, rvest, haven, and jsonlite.

Having the data ready as simple csv is useful to always be able to start from the beginning, even if the original source is unavailable.

02_Tidy.R

This step consists mostly of “non-destructive” data management: assign types to columns (factors with correct/human readable levels, dates, etc.), correct/censor obviously abnormal values and errors), transform between long and wide format, etc.
Useful packages here are lubridate, stringr, and forcats.

The results are saved in a tidy.Rdata file.

After this second step, you will have your full data ready to use in R and shouldn't have to run the first two steps anymore (unless you get hold of new data).

03_Transform.R

This script is for data transforming. It will contain all transformations of the data to make them ready for analyses.
Some “destructive” data management can occur here, such as dropping variables or observations, or modifying the levels of some factors.
Useful packages here are forcats, lubridate, and stringr.

The results are saved in a transformed.Rdata file.

04_Analyze.R

This script will contain more data transforming, and the analyses with production of the resulting tables and plots.
There is a bit of an overlap between 03_Transform.R and 04_Analyze.R as it is often an iterative process. Both files can be merged into one, but it can be useful to have some time-consuming transformations in a separate script and have the results handy.
Useful packages here are broom, ggplot2, and modelr.

In this script all the “interesting results,” full tables and ggplot graphs are included in a single hierarchical list, saved in a results.Rdata file.
All the results from the analyses should be saved as-is without transformation, so that every result can be used in the Rmd. Having all the results pre-computed for the Rmd means that it will take mere seconds to re-compile, while still having access to all the results if you want/need to use them somewhere in the manuscript/report.

The results object can look like this:

results
├─ tables
│  ├─ demographics
│  ├─ ttt_vs_control
│  └─ table3
├─ list_of_interesting_values
├─ interesting_values2
└─ plots
   ├─ figure1
   ├─ figure2
   └─ figure3

Rmd/report.Rmd

The Rmd file should not contain any literal values: every number, table, graph has to come from the results object (in its original form).
Only some really minor cosmetic modifications should be made then (running prettyNum on numerics or table columns, select/filter/arrange/rename on the full tables, etc.)
Multiple Rmds can be made using the same results: one for a full blown scientific article, one for a quick report, one for a presentation, etc.

You will never have to check again for discrepancies between tables/figures and text, or even between different media.