7.15.2015

Facebook chats analysis with R

Facebook chats analysis with R

Getting the data

Maybe some of you know that facebook provides the possibility of downloading all the information that you have shared with it as an archive of *.hml files. To get it, go to settings, choose general seetings tab and search “Download a copy of your Facebook data”. Personally I was interested in analysing my chats on FB, so it was a good idea to download this archive and start working with it.

This archive contains many files, which are organized together, and all it can be viewed like a website. But the only file that is needed is messages.hml. First I parse this file and create a list of xml nodes, each of which represents a single chat.

require(XML)
## 'data' folder contains unzipped archieve
messages.parsed <- htmlParse(readLines("data/html/messages.htm"))
nodeList <- getNodeSet(messages.parsed,"//div[@class = 'thread']")
## name of a chat is a list of chat participants
names(nodeList) <- xpathSApply(messages.parsed,"//div[@class = 'thread']/text()", xmlValue)

Not all chats are interesting, and there may be many chats with only a couple of messages. So, I select one (node from list of nodes) in which I’m interested and then apply my function getThread to it.

data <- getThread(nodeList[[certain_number]])

This function converts a node to a data.frame, easy to work with. First two columns of this data.frame stands for the name of the author of the message and the time of it’s creation, the third column contains messages themselves. Data frame looks like this (latest messages appear first):

##        name                time          message
## 1 nice girl 2015-01-30 01:17:00          Thanks!
## 2        me 2015-01-29 22:15:00 Happy bd to you!

All the functions that I’ve created to perform my analysis and this function in particular can be found in my gitHub repository.

Activity plots

First thing I wanted to do was making a density plot of chat activity with respect to the time of the day. In other words, the plot that shows in which part of the day the biggest amount of messages is written in chosen chat. The function below creates two plots: first displays the overall chat activity, and the second shows it for all chat members separately. Argument startTime defines the leftmost time point on the plot. Usually, people don’t chat at 6am (not very active at least), so it’s a good idea to start a plot at this point by default.

library(lubridate)
library(ggplot2)
dailyActivity <- function(data,startTime = 6){
    time <- hour(data$time) + minute(data$time)/60
    ## values for correct plotting
    time <- (time - startTime)%%24    
    df <- data.frame(name = data$name,time)
    togetherPlot <-
        (
            ggplot(df,aes(x=time))
            + geom_density(fill="green",alpha=0.3)
            +scale_x_continuous(limits = c(0,24),breaks = 0:24,
                                labels=(startTime + 0:24)%%24)
        )
    separatePlot <- 
        (
            ggplot(df,aes(x=time,colour=name))
            +geom_density(size=1)
            +scale_x_continuous(limits = c(0,24),breaks = 0:24,
                                labels=(startTime + 0:24)%%24)
        )  
    list(togetherPlot,separatePlot)
}

Here is an example of this function’s output:

plots <- dailyActivity(data)
plots[[1]]

plots[[2]]

We can create similar plots for a global activity (i.e. by days) also. Again, the function can be found here, in the activity.R file.

plots <- globalActivity(data)
plots[[1]]

plots[[2]]

Analysis of messages content

Since we have messages organized in a tidy data frame, we can analyse them. Unfortunately, my imagination was enough only to write a function, which calculates occurrences of a specified patterns in messages. I decided to use the data.table functionality for this purpose, so at first our data.frame must be converted to a data.table.

library(data.table)
dataT <- data.table(data)

Here is the function.

regexCount <- function(dataT, regex, ignore_case = T, type = 'ttl'){
    ## type is 'ttl' or 'avrg'
    switch(type,
        'ttl' = {title <-  paste('total n. of',regex)
                   f <- sum
                 },
        'avrg' = {title <- paste('avrg n. of',regex,'per msg')
                   f <- mean
                 }          
    )    
    if(ignore_case) 
         ans <- dataT[,.(f(str_count(message, ignore.case(regex)))),by = name]
    else ans <- dataT[,.(f(str_count(message, regex))),by = name]
    setnames(ans, 'V1', title)
    ans
}

Argument dataT stands for the data.table, regex defines the regular expression, which the function will look for. Obviously, ignore_case determines whether the case of the letters will be ignored or not while searching for regex. And for the argument type there are two possible variants: ttl- then the total number of occurrences of the pattern will be calculated; and avrg - then it will calculate the average number of occurrences of the pattern per message. Examples of usage:

regexCount(dataT,'youtube')
##       name total n. of youtube
## 1: Steve J                   0
## 2:  Mark Z                  13
## 3:  Bill G                   5
regexCount(dataT,'(^| )I([ ]|$|.|,)',type = 'avrg')
##       name avrg n. of (^| )I([ ]|$|.|,) per msg
## 1: Steve J                           0.15542522
## 2:  Mark Z                           0.07236842
## 3:  Bill G                           0.14012739

Analysis of messages of a certain person

We also can extract the messages of certain person from all the chats and perform all the things described above to a newly created data.frame. The following function creates a data.frame which consists of all the messages of a selected person (of course, from those available to me, I do not hack anyone’s accounts).

getPerson <- function(nodeList,person,use.chatnames = F){
    ## list of data.frames with only one person's messages selected
    lst <- lapply(nodeList,function(x){
      data <- getThread(x)
      data[data$name == person,]
    })
    if(!use.chatnames) names(lst) <- NULL
    ## bind all data.frames
    do.call(rbind,lst)
}

And if I want, for example, to make my daily activity plot, I should do the following:

data <- getPerson(nodeList,'Iaroslav Domin')
dailyActivity(data)[[1]]

No comments:

Post a Comment