How to download data from dynamic web pages using R. Example.
Here I will describe how I managed to download some nice tables from soccerway.com site. Mainly, I was interested in downloading tables like the one on this page: http://int.soccerway.com/teams/england/southampton-fc/670/squad/My first try was not very complicated. I simply used XML::readHTMLTable. The function I created, takes the url of the page with table and returns a tidy dataframe. The first line of the function code is for reading a table from the url given, and all the other lines are for making downloaded table tidier. And if you are familiar with R, a single look on the data right after they have been read from the site would be enough to understand what code you should add. There are some problems with reading letters with diacritics, but I will not discuss it in this post.
soccerwayTeam <- function(url) {
data <- XML::readHTMLTable(url, stringsAsFactors = F)[[1]]
nr <- nrow(data)
## number of collumns is always 17, i hope
## clear a garbage from data[nr,17]
buf <- data[[nr, 17]]
buf <- substring(buf, 1:2, 1:2)
buf <- buf[grep("[0-9]", buf)]
buf <- paste(buf, collapse = "")
data[[nr, 17]] <- buf
## col2 - portraits, col4 - nationalities, don't know how to read it
data <- data[, -c(2, 4)]
names(data) <- c("SqN", "Name", "Age", "Pos", "Min", "App", "LUp", "SIn", "SOut",
"Bench", "Goal", "Assist", "YC", "2YC", "RC")
data
}
The real problem is that after applying to a url considered, the function returns only a table for the current season. But the page provides the tables for previous seasons too. They can easily be viewed by changing a season in this select-box.So we need to come up with something new if we want to download these tables. One way to solve this problem is using javascript to perform this choice. But I decided not to take this path. It seemed like there must be a better solution, moreover, I'm not familiar with javascript.
So, what have I done? First of all I decided to find out what happens to page, when I choose the season in the check-box. What resourses are being loaded? To do this, I used page inspector built in firefox. I'm not sure, but such a feature must be present in all modern browsers. By default, it is launched with “Q” + “right-mouse-click”.
When the page inspector is launched, we should do the following:
- Select “console” tab.
- Select “Net” log.
- Perform a choice of the season (in my case, 2013) on the page.
After clicking on this link, we will see the window containing information about this resource. We only want to know the url.
If we try to open this page in the browser, we will see something like this:
It may not be obvious, but, in fact, this is the table we want to download. Moreover, the function created may also be applied to such a page.
soccerwayTeam("http://int.soccerway.com/a/block_team_squad?block_id=page_team_1_block_team_squad_3&callback_params=%7B%22team_id%22%3A670%7D&action=changeSquadSeason¶ms=%7B%22season_id%22%3A%228318%22%7D")
## SqN Name Age Pos Min App LUp SIn SOut Bench Goal Assist YC 2YC RC
## 1 1 K. Davis 38 G 180 2 2 0 0 19 0 0 0 0 0
## 2 25 P. Gazzaniga 22 G 662 8 7 1 0 13 0 0 0 0 0
## 3 31 A. Boruc 34 G 2578 29 29 0 1 0 0 1 0 0 0
## 4 41 C. Cropper 21 G 0 0 0 0 0 6 0 0 0 0 0
## 5 2 N. Clyne 23 D 1860 25 20 5 3 13 0 4 3 0 0
## 6 3 M. Yoshida 26 D 632 8 7 1 0 14 1 0 0 0 0
## 7 5 D. Lovren 25 D 2788 31 31 0 1 1 2 1 7 0 0
## 8 6 Jos\\u00e9 Fonte 30 D 3181 36 35 1 0 3 3 0 7 0 0
## 9 13 D. Fox 28 D 258 3 3 0 1 1 0 1 0 0 0
## 10 22 C. Chambers 19 D 1641 22 18 4 3 19 0 0 0 0 0
## 11 23 L. Shaw 19 D 2994 35 35 0 5 1 0 1 4 0 0
## 12 26 J. Hooiveld 31 D 270 3 3 0 0 31 0 0 1 0 0
## 13 33 M. Targett 19 D 0 0 0 0 0 1 0 0 0 0 0
## 14 4 M. Schneiderlin 25 M 2766 33 31 2 3 2 2 1 8 0 0
## 15 8 S. Davis 29 M 2499 34 28 6 14 9 2 6 3 0 0
## 16 10 G. Ram\\u00edrez 23 M 510 18 3 15 3 20 1 3 2 0 0
## 17 12 V. Wanyama 23 M 1660 23 19 4 5 7 0 0 7 0 0
## 18 16 J. Ward-Prowse 20 M 1604 34 16 18 7 22 0 2 3 0 0
## 19 18 J. Cork 25 M 1732 28 21 7 11 13 0 0 5 0 0
## 20 20 A. Lallana 26 M 3099 38 37 1 19 1 9 6 3 0 0
## 21 21 Guly 32 M 77 9 0 9 0 19 0 0 0 0 0
## 22 27 L. Isgrove 21 M 0 0 0 0 0 3 0 0 0 0 0
## 23 38 H. Reed 19 M 8 4 0 4 0 11 0 0 0 0 0
## 24 7 R. Lambert 32 A 2815 37 31 6 11 6 13 10 2 0 0
## 25 9 J. Rodriguez 25 A 2571 33 30 3 12 3 15 3 3 0 0
## 26 17 P. Osvaldo 28 A 856 13 9 4 4 4 3 0 3 0 0
## 27 19 T. Lee 28 A 0 0 0 0 0 2 0 0 0 0 0
## 28 24 E. Mayuka 23 A 0 0 0 0 0 1 0 0 0 0 0
## 29 40 S. Gallagher 19 A 379 18 3 15 3 21 1 0 0 0 0
In the next posts I will describe how to automatize the process of downloading of such a tables, in particular, how to download all the team tables for the certain league and season. Also, you are welcome to visit my Github account.
No comments:
Post a Comment