Minggu, 15 November 2015

Mini Workshop: Simple Classifier for Determining The Sentiment Orientation of A Review Sentence

Last week, Faculty of Computer Science (UI) successfully held mini workshop on big data analytics as part of joint collaboration between UI and Deakin University. One of the topics addressed at this mini workshop was about sentiment analysis from social media. I was the speaker on the Hands-On session at this mini workshop.

I delivered a tutorial on how to build a simple classifier for automatically determining the sentiment orientation of a review sentence. Sentiment orientation of an opinionated sentence can be positive or negative. This tutorial would be beneficial for those who want to learn a simple text classification problem, that is, sentiment orientation classification.

If you are interested in this tutorial, you can download the instructions as well as the dataset and resources used in this tutorial:

  1. Download the tutorial instructions
  2. Download the dataset
This tutorial will require you to do several steps:
  1. Make sure you have Python installed on your PC since we are going to do fun programming using Python language as an interface.
  2. After you have successfully installed Python, you need to install NLTK package. If you install Python using Anaconda Python Distribution, then you don't need this step since Anaconda includes more than 300 of the famous Python packages (including NLTK).




--
Ditulis di Bekasi, Indonesia
3 Safar 1437 H (15 November 2015)

Senin, 27 Juli 2015

PyMC Tutorial #3: Latent Dirichlet Allocation (LDA) using PyMC

Before you read this post, we suggest you to read our previous post regarding Naïve Bayes NB topic model since the code presented in this post is just the modification from the previous post. Some terms have been explained in the previous post.

Previously, we have explained that Naïve Bayes (NB) model assumes that each document in the collection is drawn from a single topic. Unlike Naïve Bayes, Latent Dirichlet Allocation (LDA) assumes that a single document is a mixture of several topics [1][2]. Moreover, we will use smoothed version of LDA, which is described in its original paper authored by Blei et al. [1]. The smoothed version of LDA was proposed to tackle the sparsity problem in our collection. A document might have words that do not appear in the other documents. As a result, this situation will set zero probability to such document. Hence, it is impossible to generate the document. Smoothed-LDA overcomes this kind of problem [1].


The “generative aspect” of LDA model

We will just modify a little bit the NB model from the previous post !

Suppose, \(D\) is the number of documents in our collection. \(N_d\) is the number of words in the \(d\)-th document. \(K\) is the number of predefined-topics. In this case, the number of topics is not automatically inferred. Instead, we will manually set the value of \(K\) based on our intuition.

For each topic \(k\), we draw its word distribution, which is denoted as \(\phi_k\). Since we are going to model the word distribution using multinomial/categorical distribution, we then use the Dirichlet distribution (its conjugate prior) to model its parameter.

\[\phi_k \sim Dir(\beta), 1 \leq k \leq K\]

After that, for each document \(d\), we draw a topic distribution, which is denoted as \(\theta_d\)

\[\theta_d \sim Dir(\alpha), 1 \leq d \leq D\]

See ! the preceding notation differs from NB model ! since a document might contain more than one topics (with their corresponding proportions) !

The hyperparameters \(\alpha\) and \(\beta\) are assumed to be fixed in our case.

Then, for each word \(n\) in document \(d\), we draw a topic for that word. This topic is denoted as \(z_{d,n}\)

\[z_{d,n} \sim Categorical(\theta_d), 1 \leq d \leq D, 1 \leq n \leq N_d\]

In NB model, we have seen that a single topic is associated with a document. On the other hand, LDA associates a single topic with a single word. Hence, drawing topic must be performed in the level of word, not document.

Finally, after we know the topic of a word (\(z_{d,n}\)), we then draw the physical word itself from the word distribution associated with its selected topic. Each word is denoted as \(w_{d,n}\).

\[w_{d, n} \sim Categorical(\phi_k),  1 \leq d \leq D, 1 \leq n \leq N_d\]

The following figure is the plate notation for LDA model (smoothed):




Now, it is time to build the code using PyMC. We just need to modify the previous code :)

-code1-
 import numpy as np  
 import pymc as pc  
   
   
 def wordDict(collection):  
  word_id  = {}  
  idCounter = 0  
  for d in collection:  
    for w in d:  
      if (w not in word_id):  
        word_id[w] = idCounter  
        idCounter+=1  
  return word_id  
   
 def toNpArray(word_id, collection):  
  ds = []  
  for d in collection:  
    ws = []  
    for w in d:  
      ws.append(word_id.get(w,0))  
    ds.append(ws)  
  return np.array(ds)  
   
 ###################################################  
   
 #doc1, doc2, ..., doc7  
 docs = [["sepak","bola","sepak","bola","bola","bola","sepak"],  
         ["uang","ekonomi","uang","uang","uang","ekonomi","ekonomi"],  
         ["sepak","bola","sepak","bola","sepak","sepak"],  
         ["ekonomi","ekonomi","uang","uang"],  
         ["sepak","uang","ekonomi"],  
         ["komputer","komputer","teknologi","teknologi","komputer","teknologi"],  
         ["teknologi","komputer","teknologi"]]  
   
 word_dict = wordDict(docs)  
 collection = toNpArray(word_dict,docs)  
   
 #number of topics  
 K = 3  
   
 #number of words (vocab)  
 V = len(word_dict)  
   
 #number of documents  
 D = len(collection)  
   
 #array([1, 1, 1, ..., 1]) K times  
 alpha = np.ones(K)  
   
 #array([1, 1, 1, ..., 1]) V times  
 beta = np.ones(V)  
   
 #array containing the information about doc length in our collection
 Nd = [len(doc) for doc in collection]  
   
   
 ######################## LDA model ##################################  
   
 #topic distribution per-document  
 theta = pc.Container([pc.CompletedDirichlet("theta_%s" % i,   
                                             pc.Dirichlet("ptheta_%s"%i, theta=alpha))  
                      for i in range(D)])  
   
 #word distribution per-topic  
 phi = pc.Container([pc.CompletedDirichlet("phi_%s" % j,   
                                           pc.Dirichlet("pphi_%s" % j, theta=beta))  
                     for j in range(K)])  
   
   
 #Please note that this is the tricky part :)  
 z = pc.Container([pc.Categorical("z_%i" % d,  
                                  p = theta[d],  
                                  size = Nd[d],  
                                  value = np.random.randint(K, size=Nd[d]))   
                   for d in range(D)])  
   
 #word generated from phi, given a topic z  
 w = pc.Container([pc.Categorical("w_%i_%i" % (d,i),  
                                  p = pc.Lambda("phi_z_%i_%i" % (d,i),  
                                                lambda z=z[d][i], phi=phi : phi[z]),
                                  value=collection[d][i],  
                                  observed=True)  
                   for d in range(D) for i in range(Nd[d])])  
   
 ####################################################################  
   
 model = pc.Model([theta, phi, z, w])  
 mcmc = pc.MCMC(model)  
 mcmc.sample(iter=5000, burn=1000)  
   
   
 #show the topic assignment for each word, using the last trace  
 for d in range(D):  
    print(mcmc.trace('z_%i'%d)[3999])  
   

In our computer, the output of the preceding code is as follows:

-code2-
 [2 2 2 2 2 2 2]  
 [1 1 1 0 0 1 1]  
 [2 2 2 2 2 2]  
 [1 0 0 0]  
 [1 0 1]  
 [1 2 0 1 1 1]  
 [1 1 1]  
   

What does it mean ? Let’s compare the preceding results with our document collection:

-code3-
 docs = [["sepak","bola","sepak","bola","bola","bola","sepak"],  
         ["uang","ekonomi","uang","uang","uang","ekonomi","ekonomi"],  
         ["sepak","bola","sepak","bola","sepak","sepak"],  
         ["ekonomi","ekonomi","uang","uang"],  
         ["sepak","uang","ekonomi"],  
         ["komputer","komputer","teknologi","teknologi","komputer","teknologi"],  
         ["teknologi","komputer","teknologi"]]  
   

Remember that we set \(K = 3\). This value may come from our intuition, or our prior knowledge and experience.

We can see that the word “sepak” in doc #1 is associated with topic 2. The word “ekonomi” in doc #2 is associated with topic 1. The word “uang” in doc #5 is associated with topic 0. Using the results, we we can say that topic 2 is talking about sport, while topic 1 is somehow talking about economics. And, we cannot be really certain what topic 0 tells us.

We can also read the results as follows: doc #1 consists of only topic 2. Doc #2 is a mixture of topic 1 and topic 0. Doc #6 is generated from 3 topics.

We really cannot judge whether or not this LDA model works well here. Like we said before, it will be interesting if we can try this on a real word dataset to see the behavior :)



Main References:
[1] Latent Dirichlet Allocation, David M. Blei, Andrew Ng, Michael I. Jordan. Journal of Machine Learning Research, 2003.
[2] Integrating Out Multinomial Parameters in Latent Dirichlet Allocation and Naïve Bayes for Collapsed Gibbs Sampling, Bob Carpenter, LingPipe Inc, 2010.



Alfan Farizki Wicaksono
(firstname [at] cs [dot] ui [dot] ac [dot] id)
Fakultas Ilmu Komputer, UI
Ditulis di Depok, 27 Juli 2015







PyMC Tutorial #2: Estimating the Parameters of A Naïve Bayes (NB) Model


NB model can be considered as one of the topic models (like Latent Dirichlet Allocation (LDA)), in which each word in a document is generated based on topic-specific multinomial/categorical distribution [1]. In addition to that, NB assumes that each document is drawn from a single topic (NOT mixture of topics) [1].


The “generative aspect” of NB model

Suppose, \(D\) is the number of documents in our collection. \(N_d\) is the number of words in the \(d\)-th document. \(K\) is the number of predefined-topics. In this case, the number of topics is not automatically inferred. Instead, we will manually set the value of \(K\) based on our intuition.

For each topic \(k\), we draw its word distribution, which is denoted as \(\phi_k\). Since we are going to model the word distribution using multinomial/categorical distribution, we then use the Dirichlet distribution (its conjugate prior) to model its parameter.

\[\phi_k \sim Dir(\beta),   1 \leq k \leq K\]

After that, we draw a global (collection) topic distribution, which is denoted as \(\theta\).

\[\theta \sim Dir(\alpha)\]

The hyperparameters \(\alpha\) and \(\beta\) are assumed to be fixed in our case.

Then, for each document \(d\) we draw a topic describing that document. This topic is denoted as \(z_m\).

\[z_m \sim Categorical(\theta),   1 \leq d \leq D\]

Here, Instead of modeling \(z_m\) as a multinomial random variable, we model it using Categorical distribution. We do not have any particular reason, except that later it will be easy for us to build our code this way.

Finally, after we know the topic of a document, we then draw words from the word distribution associated with its selected topic. Each word is denoted as \(w_{d,n}\).

\[w_{d, n} \sim Categorical(\phi_k),  1 \leq d \leq D, 1 \leq n \leq N_d\]

The following figure describes the plate notation of NB model




Estimating marginal posterior distribution of \(\theta\)

Our goal is to estimate \(\theta\), that is, we want to automatically infer the topic for each document, which is currently hidden. All we know is just the data, that is, all documents with their words. In other words, our task is just like a clustering task ! we aim at clustering documents that have the same topic.

First, we import the needed library for this problem.

-code1-
 import numpy as np  
 import pymc as pc  
 import random as rn  

Suppose, for our toy-implementation, our document collection is represented as follows.

-code2-
 #doc1, doc2, ..., doc7  
   
 docs = [["sepak","bola","sepak","bola","bola","bola","sepak"],  
         ["uang","ekonomi","uang","uang","uang","ekonomi","ekonomi"],  
         ["sepak","bola","sepak","bola","sepak","sepak"],  
         ["ekonomi","ekonomi","uang","uang"],  
         ["sepak","uang","ekonomi"],  
         ["komputer","komputer","teknologi","teknologi","komputer","teknologi"],  
         ["teknologi","komputer","teknologi"]]  

Then, we need to convert such collection into a special form such that we will be easy to process the collection later. In this case, we will create a set of words, and associate each word with a unique integer (as an id). Instead using array of words, we will use array of id’s.

-code3-
 def wordDict(collection):  
  word_id  = {}  
  idCounter = 0  
  for d in collection:  
    for w in d:  
      if (w not in word_id):  
         word_id[w] = idCounter  
         idCounter+=1  
  return word_id  
   
 def toNpArray(word_id, collection):  
  ds = []  
  for d in collection:  
     ws = []  
     for w in d:  
        ws.append(word_id.get(w,0))  
     ds.append(ws)  
  return np.array(ds)  

After that, we need to set several useful variables. Suppose, we set the number of topic as 3.

-code4-
 word_dict = wordDict(docs)  
 collection = toNpArray(word_dict,docs)  
   
 #number of topics  
 K = 3  
   
 #number of words (vocab)  
 V = len(word_dict)  
   
 #number of documents  
 D = len(collection)  
   
 #array([1, 1, 1, ..., 1]) K times  
 alpha = np.ones(K)  
   
 #array([1, 1, 1, ..., 1]) V times  
 beta = np.ones(V)  
   
 #array containing the information about doc length in our collection  
 Nd = [len(doc) for doc in collection]  

Now, let us see how easy for us if we want to create our own topic model using PyMC :) the following code will explain how to represent NB model using PyMC.

-code5-
 #naive bayes model  
   
 #word distribution for each topic  
 phi = pc.Container([pc.CompletedDirichlet("phi_%i" % k,   
                      pc.Dirichlet("pphi_%i" % k, theta=beta))  
           for k in range(K)])  
   
 #topic distribution in the collection  
 theta = pc.CompletedDirichlet("theta",   
                               pc.Dirichlet("ptheta", theta=alpha))  
   
 #for each document, draw a topic z_m  
 z = pc.Container([pc.Categorical("z_%i" % d,  
                                  p = theta,  
                                  value = rn.randint(0,K-1))  
                  for d in range(D)])  
   
 #for each document, draw words, based on topic z_m  
 w = pc.Container([pc.Categorical("w_%i_%i" % (d,i),  
                                  p = pc.Lambda("phi_z_%i_%i" % (d,i),  
                                                lambda z=z[d], phi=phi : phi[z]),  
                                  value=collection[d][i],  
                                  observed=True)  
                  for d in range(D) for i in range(Nd[d])])  
   
 model = pc.Model([theta, phi, z, w])  

“Container” class is used when we need to wrap all the identical random variables (such as \(\phi_k\)'s) in a single reference. “Dirichlet” is a PyMC class for Dirichlet distribution. Unfortunately, it only generates \(k-1\) values since the \(k\)-th value can be easily determined by the fact that the sum of all probability values should be \(1\). Hence, “CompletedDirichlet” class will add the \(k\)-th value to the last part. “Lambda” is a class in PyMC if we want to represent a deterministic random variable using lambda expression. You can read more about all these stuffs via PyMC homepage.


Next, let’s perform MCMC to estimate the parameter of NB model !

-code6-
 mcmc = pc.MCMC(model)  
 mcmc.sample(iter=5000, burn=1000)  

In the preceding code, we use 5000 iterations (to generate the sample as well as improve the quality of marginal posterior distribution of the parameter), and assuming that burn-in period is achieved after we perform 1000 iterations. The generated traces will not consider the results before burn-in period.

Finally, let us see the estimation results ! Remember once again, that previously we used \(K = 3\), which means that we are really sure that there are 3 topics in our collection, but we do not know which document has which topic !

-code7-
 #show z_i after sampling (random walks) has ended  
 #number of traces = iter – burn = 5000 – 1000 = 4000  
   
 for d in range(D):  
    print(mcmc.trace('z_%i'%d)[3999])  

In our computer, the results are as follows:

-code8-
 1  
 2  
 1  
 2  
 1  
 0  
 0  

Let us compare the results with our collection once again:

-code9-
 docs = [["sepak","bola","sepak","bola","bola","bola","sepak"],                  #doc1
         ["uang","ekonomi","uang","uang","uang","ekonomi","ekonomi"],            #doc2
         ["sepak","bola","sepak","bola","sepak","sepak"],                        #doc3
         ["ekonomi","ekonomi","uang","uang"],                                    #doc4
         ["sepak","uang","ekonomi"],                                             #doc5
         ["komputer","komputer","teknologi","teknologi","komputer","teknologi"], #doc6
         ["teknologi","komputer","teknologi"]]                                   #doc7

We can see that doc #1, #3, and #5 are assigned the same topic (topic 1).  Doc #2 and #4 are of the same topic as well (topic 2). Lastly, Doc #6 and #7 have the same topic (topic 0). We can see that this algorithm works pretty well on automatically clustering the documents within the same topic. However, this is just a toy-implementation. We need to be really careful about this result.

It will be interesting if we can try this program in a real-world document collection :)


Main References:
[1] Integrating Out Multinomial Parameters in Latent Dirichlet Allocation and Naïve Bayes for Collapsed Gibbs Sampling, Bob Carpenter, LingPipe Inc, 2010.



Alfan Farizki Wicaksono
(firstname [at] cs [dot] ui [dot] ac [dot] id)
Fakultas Ilmu Komputer, UI
Ditulis di Depok, 27 Juli 2015