Twitter sentiment analysis with Machine Learning in R using doc2vec approach (part 1)

Recently I’ve worked with word2vec and doc2vec algorithms that I found interesting from many perspectives. Even though I used them for another purpose, the main thing they were developed for is Text analysis. As I noticed, my 2014 year’s article Twitter sentiment analysis is one of the most popular blog posts on the blog even today. Therefore, I decided to update it with a modern approach.

The problem with the previous method is that it just computes the number of positive and negative words and makes a conclusion based on their difference. Therefore, when using a simple vocabularies approach for a phrase “not bad” we’ll get a negative estimation.

But doc2vec is a deep learning algorithm that draws context from phrases. It’s currently one of the best ways of sentiment classification for movie reviews. You can use the following method to analyze feedbacks, reviews, comments, and so on. And you can expect better results comparing to tweets analysis because they usually include lots of misspelling.

We’ll use tweets for this example because it’s pretty easy to get them via Twitter API. We only need to create an app on (My apps menu) and find an API Key, API secret, Access Token and Access Token Secret on Keys and Access Tokens menu tab.

First, I’d like to give a credit to Dmitry Selivanov, the author of the great text2vec R package that we’ll use for sentiment analysis.

You can download a set of 1.6 million classified tweets here and use them to train a model. Before we start the analysis, I want to point your attention to how tweets were classified. There are two grades of sentiment: 0 (negative) and 4 (positive). That means that they are not neutral. I suggest using a probability of positiveness instead of class. In this case, we’ll get a range of values from 0 (completely negative) to 1 (completely positive) and assume that values from 0.35 to 0.65 are somewhere in the middle and they are neutral.

The following is the R code for training the model using Document-Term Matrix (DTM) that is the result of Vocabulary-based vectorization. In addition, we will use TF-IDF method for text preprocessing. Note that model training can take up to an hour, depending on computer’s configuration:

click to expand R code

# loading packages

### loading and preprocessing a training set of tweets
# function for converting some symbols
conv_fun <- function(x) iconv(x, "latin1", "ASCII", "")

##### loading classified tweets ######
# source:
# 0 - the polarity of the tweet (0 = negative, 4 = positive)
# 1 - the id of the tweet
# 2 - the date of the tweet
# 3 - the query. If there is no query, then this value is NO_QUERY.
# 4 - the user that tweeted
# 5 - the text of the tweet

tweets_classified <- read_csv('training.1600000.processed.noemoticon.csv',
 col_names = c('sentiment', 'id', 'date', 'query', 'user', 'text')) %>%
 # converting some symbols
 dmap_at('text', conv_fun) %>%
 # replacing class values
 mutate(sentiment = ifelse(sentiment == 0, 0, 1))

# there are some tweets with NA ids that we replace with dummies
tweets_classified_na <- tweets_classified %>%
 filter( == TRUE) %>%
 mutate(id = c(1:n()))
tweets_classified <- tweets_classified %>%
 filter(! %>%
 rbind(., tweets_classified_na)

# data splitting on train and test
trainIndex <- createDataPartition(tweets_classified$sentiment, p = 0.8, 
 list = FALSE, 
 times = 1)
tweets_train <- tweets_classified[trainIndex, ]
tweets_test <- tweets_classified[-trainIndex, ]

##### Vectorization #####
# define preprocessing function and tokenization function
prep_fun <- tolower
tok_fun <- word_tokenizer

it_train <- itoken(tweets_train$text, 
 preprocessor = prep_fun, 
 tokenizer = tok_fun,
 ids = tweets_train$id,
 progressbar = TRUE)
it_test <- itoken(tweets_test$text, 
 preprocessor = prep_fun, 
 tokenizer = tok_fun,
 ids = tweets_test$id,
 progressbar = TRUE)

# creating vocabulary and document-term matrix
vocab <- create_vocabulary(it_train)
vectorizer <- vocab_vectorizer(vocab)
dtm_train <- create_dtm(it_train, vectorizer)
dtm_test <- create_dtm(it_test, vectorizer)
# define tf-idf model
tfidf <- TfIdf$new()
# fit the model to the train data and transform it with the fitted model
dtm_train_tfidf <- fit_transform(dtm_train, tfidf)
dtm_test_tfidf <- fit_transform(dtm_test, tfidf)

# train the model
t1 <- Sys.time()
glmnet_classifier <- cv.glmnet(x = dtm_train_tfidf,
 y = tweets_train[['sentiment']], 
 family = 'binomial', 
 # L1 penalty
 alpha = 1,
 # interested in the area under ROC curve
 type.measure = "auc",
 # 5-fold cross-validation
 nfolds = 5,
 # high value is less accurate, but has faster training
 thresh = 1e-3,
 # again lower number of iterations for faster training
 maxit = 1e3)
print(difftime(Sys.time(), t1, units = 'mins'))

print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4)))

preds <- predict(glmnet_classifier, dtm_test_tfidf, type = 'response')[ ,1]
auc(as.numeric(tweets_test$sentiment), preds)

# save the model for future using
saveRDS(glmnet_classifier, 'glmnet_classifier.RDS')

As you can see, both AUC on train and test datasets are pretty high (0.876 and 0.875). Note that we saved the model and you don’t need to train it every time you need to assess some tweets. Next time you do sentiment analysis, you can start with the script below.

Ok, once we have model trained and validated, we can use it. For this, we start with tweets fetching via Twitter API and preprocessing in the same way as with classified tweets. For instance, the company I work for has just released an ambitious product for Mac users and it’s interesting to analyze how tweets about SetApp are rated.

click to expand R code
### fetching tweets ###
download.file(url = "",
destfile = "cacert.pem")
setup_twitter_oauth('your_api_key', # api key
'your_api_secret', # api secret
'your_access_token', # access token
'your_access_token_secret' # access token secret

df_tweets <- twListToDF(searchTwitter('setapp OR #setapp', n = 1000, lang = 'en')) %>%
# converting some symbols
dmap_at('text', conv_fun)

# preprocessing and tokenization
it_tweets <- itoken(df_tweets$text,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = df_tweets$id,
progressbar = TRUE)

# creating vocabulary and document-term matrix
dtm_tweets <- create_dtm(it_tweets, vectorizer)

# transforming data with tf-idf
dtm_tweets_tfidf <- fit_transform(dtm_tweets, tfidf)

# loading classification model
glmnet_classifier <- readRDS('glmnet_classifier.RDS')

# predict probabilities of positiveness
preds_tweets <- predict(glmnet_classifier, dtm_tweets_tfidf, type = 'response')[ ,1]

# adding rates to initial dataset
df_tweets$sentiment <- preds_tweets

And finally, we can visualize the result with the following code:

click to expand R code
# color palette
cols <- c("#ce472e", "#f05336", "#ffd73e", "#eec73a", "#4ab04a")

samp_ind <- sample(c(1:nrow(df_tweets)), nrow(df_tweets) * 0.1) # 10% for labeling

# plotting
ggplot(df_tweets, aes(x = created, y = sentiment, color = sentiment)) +
theme_minimal() +
scale_color_gradientn(colors = cols, limits = c(0, 1),
breaks = seq(0, 1, by = 1/4),
labels = c("0", round(1/4*1, 1), round(1/4*2, 1), round(1/4*3, 1), round(1/4*4, 1)),
guide = guide_colourbar(ticks = T, nbin = 50, barheight = .5, label = T, barwidth = 10)) +
geom_point(aes(color = sentiment), alpha = 0.8) +
geom_hline(yintercept = 0.65, color = "#4ab04a", size = 1.5, alpha = 0.6, linetype = "longdash") +
geom_hline(yintercept = 0.35, color = "#f05336", size = 1.5, alpha = 0.6, linetype = "longdash") +
geom_smooth(size = 1.2, alpha = 0.2) +
geom_label_repel(data = df_tweets[samp_ind, ],
aes(label = round(sentiment, 2)),
fontface = 'bold',
size = 2.5,
max.iter = 100) +
theme(legend.position = 'bottom',
legend.direction = "horizontal",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(size = 20, face = "bold", vjust = 2, color = 'black', lineheight = 0.8),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
axis.text.y = element_text(size = 8, face = "bold", color = 'black'),
axis.text.x = element_text(size = 8, face = "bold", color = 'black')) +
ggtitle("Tweets Sentiment rate (probability of positiveness)")

The green line is the boundary of positive tweets and the red one is the boundary of negative tweets. In addition, tweets are colored with red (negative), yellow (neutral) and green (positive) colors.  As you can see, most of the tweets are around the green boundary and it means that they tend to be positive.

To be continued…

  • Pingback: Twitter sentiment analysis with Machine Learning in R using doc2vec approach - Use-R!Use-R!()

  • Pingback: Twitter sentiment analysis with Machine Learning in R using doc2vec approach | A bunch of data()

  • Pingback: Twitter sentiment analysis with Machine Learning in R using doc2vec approach – Mubashir Qasim()

  • amrit shukla

    26. dmap_at(‘text’, conv_fun) …. line throwing error “Error: unrecognised index type”

    28. mutate(sentiment = ifelse(sentiment == 0, 0, 1)) error in line “Error in mutate_(.data, .dots = lazyeval::lazy_dots(…)) : argument “.data” is missing, with no default “

  • Pingback: Linkdump #28 | WZB Data Science Blog()

  • Pingback: 3 Advanced Twitter Hacks | Massive Kontent()

  • Chri H.

    would be great if you also provide the pre-trained model as download

    • Nawied

      If you give me your E-mail adress I can send it to you

    • Muchada Kanyasa
    • rgomesf

      can you also send the trained model to me?

    • Nawied

      E-mail adress?

    • rgomesf


    • Bhupendra Kumar

      @@disqus_KKvx38gm9Q:disqus , can you please send to me.
      email id is :

      Thanks in advance!


    • Nawied

      still need it?

    • Bhupendra Kumar

      I got the code running. Thank you very much!

    • Alireza Alaei

      Can you please share your trained model with me?
      email id:

    • Muchada Kanyasa

      hey man would be great if you also provide the pre-trained model as download at

  • Amit Kumar

    Dear please share the code to me

    • AnalyzeCore

      Please, find the code in the article (“click to expand R code”)

  • Gurisht Singh

    Great article. Trying to get into text analytics by studying your code.
    Facing a problem though.
    In the 2nd portion of the code, where the search for “SetApp” is initiated, I am facing the following:

    Warning message:
    In doRppAPICall(“search/tweets”, n, params = params, retryOnRateLimit = retryOnRateLimit, :
    1000 tweets were requested but the API can only return 257

    Any way I can increase this limit? Want the sample to be much larger (pref. 2000).

    A prompt reply would be very appreciated, thanks!

    • Nawied

      For the first question: Its true. The warning your recieving will always pop-up whenever the amount of requested tweets is larger than the actual one.

  • Sreten C

    tfidf <- TfIdf$new()
    gives me an error: object 'TfIdf' not found
    I tried both 0.3.0 and 0.4.0 versions of text2vec
    Also I would really appreciate if you could send me the trained model, as well. Thanks

    • Nawied

      Im not the author, but the code works for me. If you want I can send you the trained model. Your E-mail address?

  • Pingback: Analyzing F1 tweets – F1 predictor()

  • Hi Sergey

    I came across your blog post at R-Bloggers and decided to play a bit with the code you shared.
    First of all thanks for sharing it and writing this, it is a great help to less proficient R users like me.
    I have tried to apply the code to couple subject areas and got the results I want to share:
    – When I’ve used keyword from my business area – name of eCommerce platforms like Magento, Shopify and Hybris the average sentiment score was suspiciously similar, around 0.66
    – Then I’ve tried to apply the code to a more conventional topic and used keywords “trump” and “#trump”. The resulting analysis shows much more positive attitude to Mr.Trump that I expected (image attached)

    When I started to check the individual tweets scores I’ve found that the sentiment score assigned by the program doesn’t reflect the actual sentiment (at least as I as a human being assess it) very well.

    Couple examples:

    RT @PGourevitch: Vampire president, having sucked life essence out of current loyalists, tosses their dry husks, craving fresh blood..”

    Which is quite negative got 0.57 sentiment score

    On the other hand more moderate tweet
    RT @FiveThirtyEight: Trump’s health care bill could hurt Republicans more than Obamacare hurt Dems earned clearly negative score 0.26

    RT @LouiseMensch: Meet @JackPosobiec and @EzraLevant. supreme Trump-Russia trolls who brought us #MacronLeaks! shall we say “Salut #DGSE” was considered very positive – 0.866

    and this is just what I’ve seen on the 1st page of the results.

    So IMO, the results should be used with extreme caution, probably the training set used on the model doesn’t apply very well for all subject areas.

    • AnalyzeCore

      Hi Alex,

      Thank you for the feedback and my apologies for the later answer! I totally agree with you regarding the point that results should be used with extreme caution and higher level of attention that you’ve demonstrated.

      In addition, I have an assumption why there are some incorrect results were obtained. I think, the main reason is that the model was trained using document-term matrix that means the algorithm used the fact that word(s) was/were presented in the tweet but nothing more. On the other hand, generally, it is quite difficult to work with tweets because of their specific: a lot of abbreviations, misprints, they are short and so on.

      It would be interesting to see on results using another approach to vectorization that I’m planning to implement.

      Thank you!

    • No worries Sergey. Look forward to see how other approaches work.

    • gopal wunnava

      I agree with Alex’s assessment. I obtained far more accurate results using other R packages like syuzhet,sentimentr and RSentiment which do not require any training but leverage Lexicon based techniques instead. I think such methods are better suited towards the semantics of tweets which require more sophisticated methods like valence shifting. However, I believe the use of CNN’s treating text as “images” seems far more promising than text2vec in the world of deep learning and NLP. Just my 2 cents.

    • AnalyzeCore

      Thanks for the comment. The thing is that I’ve published lexicon based approaches (one of them that uses valence value) a few years ago. It was interesting to test words/text vectorization with ML this time.

    • gopal wunnava

      I found it interesting as well. If you could find ways to more accurately represent the underlying sentiment in the tweet I think it would be super useful.

  • rgomesf

    Hi Sergey. I can’t make the first script to work 🙁

    I get this error:
    Error in dmap_at(., “text”, conv_fun) : could not find function “dmap_at”

    I have purrr package installed. Can you help?


    • AnalyzeCore

      There are some changes in purrr (latest version). dmap() was moved to purrrlyr package. Therefore try to run the code with previous version 0.2.2 until I adapt the code due to this change.

    • rgomesf

      Thanks. With purrrlyr I can now run pass that point. I’m waiting for the model to end the training. 🙂

    • Haydn

      Did you just install purrrlyr to get it to work? Mine won’t budge at all, still getting the same error despite installing the package.

    • rgomesf

      Hi. Yes, I think I only installed purrrlyr to make that error go away. I ended up using Nawied trained model. I got some more errors after that I can’t remember exactly now.

    • Nawied

      With the trained model you can pass that point and start with fetching the tweets. All you need is the trained model, though. Otherwise, did you try to library(purrrylr) without having plyr and dplyr in your library?

  • Bhupendra Kumar

    @Serg79:disqus Thank you so much for this post. I was thinking yesterday night to implement ML alogrithms for twitter sentiment analysis but I was stuck as I did not have sample data. your post has provided was I was looking for,. Some of the codes, I do not understand but I think I will try to understand it and if I will have questions then will ask you.

    Once again, thank you so much!


  • Bhupendra Kumar

    I am new to R so. What does this error message mean ( I got this after training the model)

    from glmnet Fortran code (error code -51); Convergence for 51th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returned

    • Jinwoo Park

      I am having the same problem (the error code -51) but have no clue at all. Did you find your answer to solve this problem? Thank you

    • Jinwoo Park

      I just tried the same code again and now it’s working. I guess the reason for this would be like my computer made some mistakes while executing the code. So if you have the same problem as me, just run the code once again 🙂

  • Bhupendra Kumar

    @Serg79:disqus can I do similar thing for Facebook?

    • AnalyzeCore

      Why not? But I strongly recommend training a model on Facebook’s sentences, because of tweets specific. I don’t think you will obtain good results if you train the model on tweets and apply it on Facebook. In any case, you need Facebook’s sentences marked for model validation and can check my assumption.

    • Bhupendra Kumar

      Thank you! I will give a shot 🙂 and will share my result with u. Thanks again!

  • X X

    how can i fix that? I got that after training the model

    Warning messages:
    1: In rbind(names(probs), probs_f) :
    number of columns of result is not a multiple of vector length (arg 1)
    2: from glmnet Fortran code (error code -51); Convergence for 51th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returned
    3: from glmnet Fortran code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returned
    4: from glmnet Fortran code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returned
    5: from glmnet Fortran code (error code -50); Convergence for 50th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returned
    6: from glmnet Fortran code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returned
    7: from glmnet Fortran code (error code -49); Convergence for 49th lambda value not reached after maxit=1000 iterations; solutions for larger lambdas returned

  • Haydn

    I can’t get this code running at all unfortunately. I get the following error message: In rbind(names(probs), probs_f) :
    number of columns of result is not a multiple of vector length (arg 1)

    For some reason, I can’t parse the csv.

    • Haydn

      Error in function_list[[i]](value) : could not find function “dmap_at”

      And this one, I also installed Purrr as per the below.

  • Tarik

    I’m running into a similar issue as a previous poster, except in a different location.
    line 12 of the “fetching tweets” section returns ‘Error: unrecognised index type’ is this an issue in the dimensions of the file I’m pulling from the URL?

    Thank you!

    • Tarik

      @Serg79:disqus or is this just a issue with dmap? I’m currently using purrrlyr package as listed in your code

    • Nawied

      its a typo:

      df_tweets %
      #converting some symbols
      dmap_at(‘text’, conv_fun)

      With the purrrlyr it should work.

    • Tarik

      thank you so much!

  • AnalyzeCore

    I’ve added library(purrrlyr) with dmap_at() function to the code

  • A Bayesian Bajan!

    Hi Sergey,

    I’ve been studying your model for the last two days with the aim of running it for a work related task. One concern I have is how you would propose handling RTs (Retweets)?

    For some hash tag searches, particularly in sports, a hashtag may be included in tweets from sports news outlets/writers and thus an abundance of the tweets in a sample may be the same tweet many times. For example, I ran it against a basketball related hashtag and of 1000 tweets in the sample, only 26 were not retweeted at all; +600 of the tweets were retweeted at least 100 times.

    This sample then can’t be used to gather real sentiment when the majority of the sample is just a news or content or perhaps one opinion being repeated.

    Appreciate your thoughts on how you might combat this within your model!


    • AnalyzeCore

      You can easily remove retweets applying filter(isRetweet == FALSE) to df_tweets

    • A Bayesian Bajan!

      Hi Sergey,

      I should’ve posted that after my post I did simply apply a filter before you mentioned it, was quite obvious but didn’t realise it until after! Being able to filter out RTs in the fetch process would be even more ideal but more complex I imagine.

      The set of 1.6m tweets you use as a sample, how have those been accumulated? Was it by yourself or another source? I ask as its a brilliant idea for the training portion of the model but I’d like to be able to collate tweets related to more specific categories in the hope that it may help to train the model to better interprete sentiment around specific subjects.

      Thanks again for this great piece of modelling!

    • AnalyzeCore

      The following is the link on the source:

  • Punitha C

    Hi Sergey,

    Thanks for the article, its very useful. I have one question. Doc2Vec is an unsupervised way of classifying data. But the method followed here is the supervised way. So kindly can I know that by using text2vec can we perform unsupervised classification of text data. It will be great if you can share any article if you have written in an unsupervised way of classification. thanks in advance

    • AnalyzeCore

      Doc2Vec/Text2Vec/Paragraph2Vec/GloVe are methods for data transformation from string to numerical vectors. Then, these vectors can be used, for instance, for classification as is in this blog post or for clustering.