11.15.2014

How to download data from dynamic web pages using R. Example.

How to download data from dynamic web pages using R. Example.

How to download data from dynamic web pages using R. Example.

Here I will describe how I managed to download some nice tables from soccerway.com site. Mainly, I was interested in downloading tables like the one on this page: http://int.soccerway.com/teams/england/southampton-fc/670/squad/

My first try was not very complicated. I simply used XML::readHTMLTable. The function I created, takes the url of the page with table and returns a tidy dataframe. The first line of the function code is for reading a table from the url given, and all the other lines are for making downloaded table tidier. And if you are familiar with R, a single look on the data right after they have been read from the site would be enough to understand what code you should add. There are some problems with reading letters with diacritics, but I will not discuss it in this post.


soccerwayTeam <- function(url) {
    data <- XML::readHTMLTable(url, stringsAsFactors = F)[[1]]
    nr <- nrow(data)
    ## number of collumns is always 17, i hope

    ## clear a garbage from data[nr,17]
    buf <- data[[nr, 17]]
    buf <- substring(buf, 1:2, 1:2)
    buf <- buf[grep("[0-9]", buf)]
    buf <- paste(buf, collapse = "")
    data[[nr, 17]] <- buf

    ## col2 - portraits, col4 - nationalities, don't know how to read it
    data <- data[, -c(2, 4)]

    names(data) <- c("SqN", "Name", "Age", "Pos", "Min", "App", "LUp", "SIn", "SOut", 
        "Bench", "Goal", "Assist", "YC", "2YC", "RC")

    data
}
The real problem is that after applying to a url considered, the function returns only a table for the current season. But the page provides the tables for previous seasons too. They can easily be viewed by changing a season in this select-box.
So we need to come up with something new if we want to download these tables. One way to solve this problem is using javascript to perform this choice. But I decided not to take this path. It seemed like there must be a better solution, moreover, I'm not familiar with javascript.
So, what have I done? First of all I decided to find out what happens to page, when I choose the season in the check-box. What resourses are being loaded? To do this, I used page inspector built in firefox. I'm not sure, but such a feature must be present in all modern browsers. By default, it is launched with “Q” + “right-mouse-click”.
When the page inspector is launched, we should do the following:
  1. Select “console” tab.
  2. Select “Net” log.
  3. Perform a choice of the season (in my case, 2013) on the page.
After this actions performed, we may see the list of resources, that have been accessed. I our case, we are interested in this one:
After clicking on this link, we will see the window containing information about this resource. We only want to know the url.
If we try to open this page in the browser, we will see something like this:
It may not be obvious, but, in fact, this is the table we want to download. Moreover, the function created may also be applied to such a page.
soccerwayTeam("http://int.soccerway.com/a/block_team_squad?block_id=page_team_1_block_team_squad_3&callback_params=%7B%22team_id%22%3A670%7D&action=changeSquadSeason&params=%7B%22season_id%22%3A%228318%22%7D")
##    SqN             Name Age Pos  Min App LUp SIn SOut Bench Goal Assist YC 2YC RC
## 1    1         K. Davis  38   G  180   2   2   0    0    19    0      0  0   0  0
## 2   25     P. Gazzaniga  22   G  662   8   7   1    0    13    0      0  0   0  0
## 3   31         A. Boruc  34   G 2578  29  29   0    1     0    0      1  0   0  0
## 4   41       C. Cropper  21   G    0   0   0   0    0     6    0      0  0   0  0
## 5    2         N. Clyne  23   D 1860  25  20   5    3    13    0      4  3   0  0
## 6    3       M. Yoshida  26   D  632   8   7   1    0    14    1      0  0   0  0
## 7    5        D. Lovren  25   D 2788  31  31   0    1     1    2      1  7   0  0
## 8    6 Jos\\u00e9 Fonte  30   D 3181  36  35   1    0     3    3      0  7   0  0
## 9   13           D. Fox  28   D  258   3   3   0    1     1    0      1  0   0  0
## 10  22      C. Chambers  19   D 1641  22  18   4    3    19    0      0  0   0  0
## 11  23          L. Shaw  19   D 2994  35  35   0    5     1    0      1  4   0  0
## 12  26      J. Hooiveld  31   D  270   3   3   0    0    31    0      0  1   0  0
## 13  33       M. Targett  19   D    0   0   0   0    0     1    0      0  0   0  0
## 14   4  M. Schneiderlin  25   M 2766  33  31   2    3     2    2      1  8   0  0
## 15   8         S. Davis  29   M 2499  34  28   6   14     9    2      6  3   0  0
## 16  10 G. Ram\\u00edrez  23   M  510  18   3  15    3    20    1      3  2   0  0
## 17  12       V. Wanyama  23   M 1660  23  19   4    5     7    0      0  7   0  0
## 18  16   J. Ward-Prowse  20   M 1604  34  16  18    7    22    0      2  3   0  0
## 19  18          J. Cork  25   M 1732  28  21   7   11    13    0      0  5   0  0
## 20  20       A. Lallana  26   M 3099  38  37   1   19     1    9      6  3   0  0
## 21  21             Guly  32   M   77   9   0   9    0    19    0      0  0   0  0
## 22  27       L. Isgrove  21   M    0   0   0   0    0     3    0      0  0   0  0
## 23  38          H. Reed  19   M    8   4   0   4    0    11    0      0  0   0  0
## 24   7       R. Lambert  32   A 2815  37  31   6   11     6   13     10  2   0  0
## 25   9     J. Rodriguez  25   A 2571  33  30   3   12     3   15      3  3   0  0
## 26  17       P. Osvaldo  28   A  856  13   9   4    4     4    3      0  3   0  0
## 27  19           T. Lee  28   A    0   0   0   0    0     2    0      0  0   0  0
## 28  24        E. Mayuka  23   A    0   0   0   0    0     1    0      0  0   0  0
## 29  40     S. Gallagher  19   A  379  18   3  15    3    21    1      0  0   0  0
In the next posts I will describe how to automatize the process of downloading of such a tables, in particular, how to download all the team tables for the certain league and season. Also, you are welcome to visit my Github account.

No comments:

Post a Comment