|
@@ -0,0 +1,141 @@ |
|
|
|
|
|
--- |
|
|
|
|
|
title: "Web scraping" |
|
|
|
|
|
author: "Maxime Wack" |
|
|
|
|
|
date: "19/11/2019" |
|
|
|
|
|
output: |
|
|
|
|
|
xaringan::moon_reader: |
|
|
|
|
|
css: ['default','css/my_style.css'] |
|
|
|
|
|
lib_dir: libs |
|
|
|
|
|
seal: false |
|
|
|
|
|
nature: |
|
|
|
|
|
ratio: '4:3' |
|
|
|
|
|
countIncrementalSlides: false |
|
|
|
|
|
self-contained: true |
|
|
|
|
|
beforeInit: "addons/macros.js" |
|
|
|
|
|
highlightLines: true |
|
|
|
|
|
pdf_document: |
|
|
|
|
|
seal: false |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
```{r setup, include=FALSE} |
|
|
|
|
|
knitr::opts_chunk$set(echo = TRUE, fig.asp= .5) |
|
|
|
|
|
library(tidyverse) |
|
|
|
|
|
library(DT) |
|
|
|
|
|
library(knitr) |
|
|
|
|
|
|
|
|
|
|
|
options(DT.options = list(paging = F, |
|
|
|
|
|
info = F, |
|
|
|
|
|
searching = F)) |
|
|
|
|
|
|
|
|
|
|
|
datatable <- partial(datatable, rownames = F) |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
class: center, middle, title |
|
|
|
|
|
|
|
|
|
|
|
# UE Visualisation |
|
|
|
|
|
|
|
|
|
|
|
### 2019-2020 |
|
|
|
|
|
|
|
|
|
|
|
## Dr. Maxime Wack |
|
|
|
|
|
|
|
|
|
|
|
### AHU Informatique médicale |
|
|
|
|
|
#### Hôpital Européen Georges Pompidou, </br> Université de Paris |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# Web scraping |
|
|
|
|
|
|
|
|
|
|
|
### Utilisation de `httr` et `rvest` |
|
|
|
|
|
|
|
|
|
|
|
## httr |
|
|
|
|
|
|
|
|
|
|
|
Permet de faire des requêtes réseau |
|
|
|
|
|
|
|
|
|
|
|
→ interroger et télécharger directement depuis R |
|
|
|
|
|
|
|
|
|
|
|
## rvest |
|
|
|
|
|
|
|
|
|
|
|
Extraction de données depuis des pages HTML |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# httr |
|
|
|
|
|
|
|
|
|
|
|
```{r init, echo = F, message = F, error = F} |
|
|
|
|
|
library(tidyverse) |
|
|
|
|
|
library(httr) |
|
|
|
|
|
library(rvest) |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
Télécharger une page wikipedia |
|
|
|
|
|
|
|
|
|
|
|
```{r dl wikipedia} |
|
|
|
|
|
GET("https://en.wikipedia.org/wiki/Comparison_of_operating_systems") -> wiki |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
```{r dl wikipedia do, echo = F} |
|
|
|
|
|
wiki |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# Parsing HTML |
|
|
|
|
|
|
|
|
|
|
|
```{r html} |
|
|
|
|
|
wiki %>% |
|
|
|
|
|
read_html -> wiki_html |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
```{r html do, echo = F} |
|
|
|
|
|
wiki_html |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# Sélecteurs CSS |
|
|
|
|
|
|
|
|
|
|
|
[W3Schools](https://www.w3schools.com/cssref/css_selectors.asp) |
|
|
|
|
|
|
|
|
|
|
|
### Selecteurs permettant d'identifier un **nœud** précis dans le **DOM** (Document Object Model) d'une page HTML |
|
|
|
|
|
|
|
|
|
|
|
### Permet de sélectionner par identifiant, classe, position dans la hiérarchie, position entre élements d'un même niveau, ou relativement entre élements |
|
|
|
|
|
|
|
|
|
|
|
### Utiliser l'**inspecteur** des outils de développement du navigateur pour identifier les éléments à capturer |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# Sélecteurs CSS |
|
|
|
|
|
|
|
|
|
|
|
```{r tables} |
|
|
|
|
|
wiki_html %>% |
|
|
|
|
|
html_nodes(".wikitable") |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
```{r table} |
|
|
|
|
|
wiki_html %>% |
|
|
|
|
|
html_node("div + .wikitable") |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# Extraction d'une table |
|
|
|
|
|
|
|
|
|
|
|
```{r scrape} |
|
|
|
|
|
wiki_html %>% |
|
|
|
|
|
html_node("div + .wikitable") %>% |
|
|
|
|
|
html_table -> wikitable |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
```{r scrape do, echo = F} |
|
|
|
|
|
datatable(wikitable) |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# Exercices |
|
|
|
|
|
|
|
|
|
|
|
### Transformer cette table en forme normale |
|
|
|
|
|
|
|
|
|
|
|
### Extraire la table avec les informations techniques |
|
|
|
|
|
|
|
|
|
|
|
### Identifier les OS libres fonctionnant avec un microkernel |