Twitter sentiment analysis with R

Recently I’ve designed a relatively simple code in R for analyzing Twitter posts content via calculating the number of positive, negative and neutral words. The idea of processing tweets is based on the presentation http://www.slideshare.net/ajayohri/twitter-analysis-by-kaify-rais. The words in the tweet correspond with the words in dictionaries that you can find on the internet or create by yourself. It is also possible to edit these dictionaries. Really great work, but I’ve discovered some issue.

There are some limitations in the API of Twitter. It depends on the total number of tweets you access via API, but usually you can get tweets for the last 7-8 days (not longer, and it can be 1-2 days only). The 7 to 8 days time limit doesn’t allow us to analyze historical trends.

My idea is to create a storage file in order to accumulate historical data and bypass API’s limitations. If you extract tweets regularly, you would analyze the dynamics of sentiments with the chart like this one:

plot

Furthermore, this algorithm includes a function that allows you to extract quite a few keywords that you are interested in. The process can be repeated several times a day and data set for each keyword will be saved separatly. It can helpful, for example, for doing competitors analysis.

Let’s start. We need to create Twitter Application (https://apps.twitter.com/) in order to have an access to Twitter’s API. Then we will get Consumer Key and Consumer Secret.

#connect all libraries
 library(twitteR)
 library(ROAuth)
 library(plyr)
 library(dplyr)
 library(stringr)
 library(ggplot2)
#connect to API
 download.file(url='http://curl.haxx.se/ca/cacert.pem', destfile='cacert.pem')
 reqURL <- 'https://api.twitter.com/oauth/request_token'
 accessURL <- 'https://api.twitter.com/oauth/access_token'
 authURL <- 'https://api.twitter.com/oauth/authorize'
 consumerKey <- '____________' #put the Consumer Key from Twitter Application
 consumerSecret <- '______________'  #put the Consumer Secret from Twitter Application
 Cred <- OAuthFactory$new(consumerKey=consumerKey,
                                                       consumerSecret=consumerSecret,
                                                       requestURL=reqURL,
                                                       accessURL=accessURL,
                                                       authURL=authURL)
 Cred$handshake(cainfo = system.file('CurlSSL', 'cacert.pem', package = 'RCurl')) #There is URL in Console. You need to go to, get code and enter it in Console
save(Cred, file='twitter authentication.Rdata')
 load('twitter authentication.Rdata') #Once you launched the code first time, you can start from this line in the future (libraries should be connected)
 registerTwitterOAuth(Cred)
#the function for extracting and analyzing tweets
 search <- function(searchterm)
 {
 #extact tweets and create storage file
list <- searchTwitter(searchterm, cainfo='cacert.pem', n=1500)
 df <- twListToDF(list)
 df <- df[, order(names(df))]
 df$created <- strftime(df$created, '%Y-%m-%d')
 if (file.exists(paste(searchterm, '_stack.csv'))==FALSE) write.csv(df, file=paste(searchterm, '_stack.csv'), row.names=F)
#merge the last extraction with storage file and remove duplicates
 stack <- read.csv(file=paste(searchterm, '_stack.csv'))
 stack <- rbind(stack, df)
 stack <- subset(stack, !duplicated(stack$text))
 write.csv(stack, file=paste(searchterm, '_stack.csv'), row.names=F)
#tweets evaluation function
 score.sentiment <- function(sentences, pos.words, neg.words, .progress='none')
 {
 require(plyr)
 require(stringr)
 scores <- laply(sentences, function(sentence, pos.words, neg.words){
 sentence <- gsub('[[:punct:]]', "", sentence)
 sentence <- gsub('[[:cntrl:]]', "", sentence)
 sentence <- gsub('\\d+', "", sentence)
 sentence <- tolower(sentence)
 word.list <- str_split(sentence, '\\s+')
 words <- unlist(word.list)
 pos.matches <- match(words, pos.words)
 neg.matches <- match(words, neg.words)
 pos.matches <- !is.na(pos.matches)
 neg.matches <- !is.na(neg.matches)
 score <- sum(pos.matches) - sum(neg.matches)
 return(score)
 }, pos.words, neg.words, .progress=.progress)
 scores.df <- data.frame(score=scores, text=sentences)
 return(scores.df)
 }
pos <- scan('C:/___________/positive-words.txt', what='character', comment.char=';') #folder with positive dictionary
 neg <- scan('C:/___________/negative-words.txt', what='character', comment.char=';') #folder with negative dictionary
 pos.words <- c(pos, 'upgrade')
 neg.words <- c(neg, 'wtf', 'wait', 'waiting', 'epicfail')
Dataset <- stack
 Dataset$text <- as.factor(Dataset$text)
 scores <- score.sentiment(Dataset$text, pos.words, neg.words, .progress='text')
 write.csv(scores, file=paste(searchterm, '_scores.csv'), row.names=TRUE) #save evaluation results
#total score calculation: positive / negative / neutral
 stat <- scores
 stat$created <- stack$created
 stat$created <- as.Date(stat$created)
 stat <- mutate(stat, tweet=ifelse(stat$score > 0, 'positive', ifelse(stat$score < 0, 'negative', 'neutral')))
 by.tweet <- group_by(stat, tweet, created)
 by.tweet <- summarise(by.tweet, number=n())
 write.csv(by.tweet, file=paste(searchterm, '_opin.csv'), row.names=TRUE)
#chart
 ggplot(by.tweet, aes(created, number)) + geom_line(aes(group=tweet, color=tweet), size=2) +
 geom_point(aes(group=tweet, color=tweet), size=4) +
 theme(text = element_text(size=18), axis.text.x = element_text(angle=90, vjust=1)) +
 #stat_summary(fun.y = 'sum', fun.ymin='sum', fun.ymax='sum', colour = 'yellow', size=2, geom = 'line') +
 ggtitle(searchterm)
ggsave(file=paste(searchterm, '_plot.jpeg'))
}
search("______") #enter keyword

Finally we will get four files:

  • storage file with initial data,
  • file with tweets rating ,
  • file with the number of tweets of each type (positive / negative / neutral) as of date,
  • and the chart that looks like:

plot

  • Pingback: Twitter sentiment analysis based on affective lexicons in R | Analyze Core()

  • WhatUpData

    I keep getting “geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?” The chart it creates only comes up with one date and 3 points for pos, neg, neutral.

    • AnalyzeCore

      What was the keyword? And how many tweets are in the stack file?

    • WhatUpData

      #RepublicanResponse and it says read 2006 items and 4783 items. The stack file has 507 lines. How can you tell how many tweets are in the file? I ran it again and now it has 2 points and 2 dates jan 20 and 21. I’m guessing the keyword will need to be older then 2 days to gain more points? New to all this so sorry if my info is incomplete.

    • AnalyzeCore

      It’s clear for me now. That was because Twitter API returned tweets for 1 date only (that is usual situation for popular keywords). As I mentioned in the post, the API has some limitations and doesn’t allow to see a lot of historical data (it can be less then 1 day only). That is why I created a stack file. Therefore, you need to extract data frequently (for example, every day) to see trend. And you won’t see warning “geom_path:…” once you have data for more then 1 date.
      P.S.: each row in the stack file is unique tweet.

    • WhatUpData

      Ah ok, I see. Thanks for the help and sharing this!

    • Nawied

      Hey! Im encountering the same problem. I was trying to collect tweets for different days of last week via “from” and “until”, but the stack file seems to replace the previous results and not to add up with the previous results. I also tried to collect for 2 days without using “from” and “until” in the twitter search. What am I doing wrong? Normally just search the desired word and run the function for several days, right?

      Would appreciate some help.

  • salmi

    hello , please your help,

    I have a question: when you use api twitter you must have a specific account twitter this means that i can have the same result if i use another twitter account ?

    thanks in advance

    • AnalyzeCore

      Hello! The Twitter developers account doesn’t affect results that API returns. It doesn’t matter. The other problem you can face that API can return different result for the same account (e.g. it can extract 100 tweets one time and 200 for the next search for the same keyword). This is Twitter API’s specific.

    • salmi

      so if i understand you, this means for exemple whene i search some word using the same code like : serchTwitter(“addidas”,n=5) but i use different account twitter in the same moment , i will have the same results ? In other words is the approch depends on account twitter(old account and new account) ?

      Thanks you 🙂

    • AnalyzeCore

      This works like search function on Twitter. Would there be some difference in the search result between searching from your account and someone else? You can easily test this by yourself.

    • salmi

      Thanks for the help

  • salmi

    i have an error : Error in tw_from_response(out, …) :

    unused argument (cainfo = “cacert.pem”)

    Called from: twInterfaceObj$doAPICall(cmd, params, “GET”, …)

    Browse[1]>

    Thanks in advance

  • salmi

    Error: invalid input ‘Dadelijk intake toets op í ½í²ª’ in ‘utf8towcs’

    Called from: stop(list(message = “invalid input ‘Dadelijk intake toets op arcus í ½í²ª’ in ‘utf8towcs'”,

    call = NULL, cppstack = NULL))

    Browse[1]

    • AnalyzeCore

      Please, follow instructions carefully and take into account that this approach works with English language.

    • salmi

      hello ,

      Thank you very much for your rapid response, you say that this approach works with English language this means that i can not use it in order to search a french word?

      Thanks in advance for your help sir

    • AnalyzeCore

      I haven’t tried, but I assume that there can be some issue with other languages. For example, I’ve read about Russian. Therefore, maybe you need to adapt the code.

    • salmi

      i use the same code like you in order to do sentiments analysis for some french words (the same positive and négative word) but in graph there are many neutral sentiments this means that i should use another dictionary for words ?

      Thanks in advance and sharing this!

    • AnalyzeCore

      There are not dictionaries in this post, only example of path to them and how to add new words. You need to create or find somewhere (e.g. on the internet) these dictionaries and place them on your computer and define the path.

    • salmi

      Hello,
      yes ,i find postive and negative word on the internet :https://github.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107/tree/master/data but there are in english language so the question should I use a french dictionary for words( positive and negative word in french language) in order to do sentiments analysis for some french words?

      thanks in advace

    • AnalyzeCore

      Of course you should!

    • salmi

      Hello,
      Thanks sir for your help.

    • salmi

      Hello,

      i have this Error:
      invalid input ‘Sunshine lollipops â˜€ï¸í ¼í½­â˜€ï¸í ¼í½­’ in ‘utf8towcs’
      Called from: stop(list(message = “invalid input ‘Sunshine lollipops â˜€ï¸í ¼í½­â˜€ï¸í ¼í½­’ in ‘utf8towcs'”,
      call = NULL, cppstack = NULL))

      Thanks in advance.

  • salmi

    Hello,

    In order to connect to API twitter, your code does not work for me but i use only one line witch work perfectly :

    setup_twitter_oauth(“consumerKey”, “consumerSecret”)

    instead of :

    download.file(url=’http://curl.haxx.se/ca/cacert.pem’, destfile=’cacert.pem’)
    reqURL <- 'https://api.twitter.com/oauth/request_token&#039;
    accessURL <- 'https://api.twitter.com/oauth/access_token&#039;
    authURL <- 'https://api.twitter.com/oauth/authorize&#039;
    consumerKey <- '____________' #put the Consumer Key from Twitter Application
    consumerSecret <- '______________' #put the Consumer Secret from

    i have another question haw to make permenante connection in order to do this step only once time ?

    thank you very much

    • AnalyzeCore

      This post was published one year ago. Therefore, there could be some changes in the connection scheme.
      I haven’t heard about permanent connection. Try to read APIs help.

    • salmi

      thank you but i do not undestand this comment in your code #Once you launch the code first time, you can start from this line in the future (libraries should be connected)

      That is means a permanet connection in the future ?

  • magesh

    registerTwitterOAuth(Cred)
    Error in registerTwitterOAuth(Cred) :
    ROAuth is no longer used in favor of httr, please see ?setup_twitter_oauth

    > how to solve this type of error

    • salmi

      Hello MAGESH,

      you should use this line in order to get connection to API twitter:

      setup_twitter_oauth(“consumerKey”, “consumerSecret”)

      Let us know if you find another solution

  • ozz

    This code not using machine learning methods right?

    • AnalyzeCore

      It doesn’t use ML.

  • magesh

    yes this correct

  • Sanya Goyal

    i have a problem in the 10th block, 7th line of code. I want to know where did ‘created’ come from, because no object has been called as ‘created’ before.
    Thanks

    • AnalyzeCore

      This is a column of ‘stat’ data frame and it came from fetching tweets initially (you can find df$created in the 5th block of code).

  • Pingback: Twitter Sentiment Analysis in R with Keyword #Modi | Topbullets - A Digital Notebook()

  • Federico Scenna

    Hi! Thanks for sharing the

    code! I tried to run it and at first it ran ok, but then I tried with other search words and the R Studio session gets to a fatal error. Any help?
    Thanks beforehand!

    • AnalyzeCore

      Try to update all packages and R to the last version. Also, I haven’t checked this code with other languages excepting English. Therefore, if you used another language there could be some potential issues.

  • kmrychl

    what if i have multiple search terms. What section do I need to run multiple times. for example if I run list <- search term "#music" then create a df then push to the stack. Then I want to run list <- search term "#band" and combine this to the music list. Will the final file be called stack?

    • AnalyzeCore

      I’m not a fan of loops, but if you can’t implement some changes in the current function by yourself I can suggest you use a loop at the end of the script. Try to add something like this instead of search(“______”) #enter keyword

      words.list <- c('word1', 'word2', 'world3')
      df.results <- data.frame()
      for (i in c(1:length(words.list))) {
      search(words.list[i])
      filename <- paste(words.list[i], '_stack.csv')
      df.temp <- read.csv(filename)
      df.results <- rbind(df.results, df.temp)
      rm(df.temp)
      }

      In this way you don't need to change the other code.

    • kmrychl

      Thanks…also when running:
      pos <- scan('C:/___________/positive-words.txt', what='character', comment.char=';')

      positive-words.txt does not exist. Do I need to create this file first? Sorry am new to R!

    • AnalyzeCore

      You can download dictionaries from the internet, just google positive and negative words or find a link in comments to this post.

  • Ali Hadi Al-geboory

    Hi
    please can you help me how to install R extension in rapidminer on mac 10.10 and do you have any information how can I make matching between dictionary and dataset
    thanks

  • Pingback: German Twitter Sentiment Analysis – My little jotter()

  • Quantguy

    Hey Brly, thanks a lot for this wonderful code. But I am struck some where near authentication.
    I use the code registerTwitterOAuth(Cred)
    but I get the error.
    Error in registerTwitterOAuth(Cred) :
    ROAuth is no longer used in favor of httr, please see ?setup_twitter_oauth

    Seeing this query in this thread, I use setup_twitter_oauth(“consumerKey”, “consumerSecret”) and it say
    Error in init_oauth1.0(self$endpoint, self$app, permission = self$params$permission, :
    Unauthorized (HTTP 401).

    Can anybody help me out

    • Quantguy

      Hey i was just working around and got the way using setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret) instead of registerTwitterOAuth(Cred).

      But now I am held up at line 46:
      + sentence <- gsub('d+', "", sentence)

      which throws up an error saying.
      Error: 'd' is an unrecognized escape in character string starting "'d"

      What shall I do

  • Pingback: Twitter sentiment analysis with Machine Learning in R using doc2vec approach - AnalyzeCore - data is beautiful, data is a story()

  • Pingback: Twitter sentiment analysis with Machine Learning in R using doc2vec approach - Use-R!Use-R!()

  • Deepti Nimmagadda

    Hii!!!
    I am trying to access the positive words negative words text files and I keep getting the error no such file or directory even when I unpacked the files into the current working directory. I keep trying to resolve this error but I still can’t make sense of this.

    I managed to download tweets the only thing I can’t do is access these files. I did the very same thing as you have in the above code but no luck.
    please help!!!

    • AnalyzeCore

      Double check RStudio working directory, dictionary file names and file types. Also, there is a difference on how to use “/” and “” for Windows and Mac OS.

    • Deepti Nimmagadda

      Thanks for the help!!! I figured it out actually! It was the exact same thing you said. ‘/’ is where I went wrong. It only took me 3 hours after posting the first comment to figure it out and that only happened after I took a look at the entire thing with a clear head.

      Actually putting the path in the scan() worked for me. I simply included the name of the file as I’ve seen done on the internet but that didn’t work out so well for me. I thought simply unpacking it into my working directory would do it but no such luck. It was stroke of pure luck a friend of mine made me try putting the path!

      So I’m finally making the headway needed for my project and it’s a huge weight off my shoulders. People like you who post this on the internet are pure life savers because we are able to adapt the code you post and make headway into our projects. Thanks for the all the work you put into sites like these and for the quick reply!!!

  • Pingback: Twitter sentiment analysis with Machine Learning in R using doc2vec approach – Cloud Data Architect()

  • Pingback: Twitter sentiment analysis with Machine Learning in R using doc2vec approach (part 1) – Cloud Data Architect()

  • Renato Falcon Lyke

    Hi
    I am performing some sentiment analysis based on a CSV file. I am able to get the overall sentiment analysis. I wish to perform Sentiment Analysis for each record and then group by another column.
    Eg. Columns A has Direction (North South etc) and Column B has Comments for each record.

    How do i tie the sentiments from Column a for each direction.

    Regards,
    Ren