The Vector Space Model for Information Retrieval (IR)

binary, tf, tfidf, stoplist, stemming...

Posted by Hao Xu on October 28, 2018

The Vector Space Model

The vector space model maps each document $d\in D$ to a normalized vector $\vec{v}(d)\in V$. Each dimension $i$ of $V$ corresponds to a unique word $w_i$ in the document collection. We treat query $q\in Q$ as a special document, and use the same map to map it into a vector $\vec{v}(q)\in V$. The similarity between a documents $d$ and a query $q$ is calculated by the inner product:

\[(\vec{v}(d),\ \vec{v}(q))=\sum_i v_i (d)v_i(q)\]

Let’s make some notation. The number of documents in $D$ is denoted by $|D|$. Using $tf_i$ to denote the number of times $w_i$ appears in $d$, and $df_i$ to denote the number of documents that contains $w_i$.

Here, we consider three different document-to-vector maps:

  1. binary: $\tilde{v}_i (d)= \begin {cases} 1, & w_i \ appears\ in\ d \ 0, & else \end {cases}$
  2. tf: $\tilde{v}_i (d)= tf_i$
  3. tfidf: $\tilde{v}_i (d)= \frac{tf_i}{idf_i}$, where $idf_i = \log \frac{|D|}{df_i}$ is called inverted document frequency.

For all three cases vector need to be normalized: $v_i (d)=\tilde{v_i}(d)/||\tilde{v}||$. There are two more pre-processing options in our model:

  1. -p: Ignore all the words in the ‘stop list’
  2. -s: Use the same index for different words with the same word stem

Implementation Example

Overall, this example builds an IR system for finding the top 10 most relevant documents to a given query. Our standard test collection is from CAMA (Communications of the Association for Computing Machinery).

Here is the detailed discription of the implement which is written in the my_retriever.py file. Here is the Link to this Github repository. There is also a report on evaluatation of the performance of vector space model based IR System with different term manipulation strategies.

There are four methods and a constructor in the Retrieve class. The parameters for creating an object of Retrieve class are the dictionary index, and a string type term weighting scheme termWeighting, which can be 'binary', 'tf', or 'tfidf' . The keys of index are the unique words in $D$, and the values are dictionaries whose keys are the index of document and the values are the term frequencies in documents for this word. Indexing the keys of index from 0 and store it into self.words, so that changing the datatype of word reference from string to int.

When initialing the object, the total number of the documents $|D|$ and weighted coordinate for all documents vectors are calculated by computeNumOfDocument() and initTerm() respectively. Because of the data type of index, $|D|$ is the length of the set which contains all keys in the nested dictionaries of index. In order to increase efficiency, the coordinates are stored in a two-dimension float numpy array. Each row in this array is a document vector, and the columns represent the dimensions.

When computing the weighted coordinate, both the documents array and the input query array are use following steps. Firstly, traverse through the index or query, and locate this term in the array. Then, determine whether the value of this location is 1 or the term frequency based on whether the weighting scheme is binary or tf. Thirdly, if it is weighted by tfidf, the inversed document frequency (IDF) and TFIDF need to be computed in order. Finally, normalize the vector for each document such that the norm of vectors is 1. This means that the independent variable - the length of the documents - will not affect the similarity ranking. However, the query vector does not need to be normalized because it is a constant across comparisons.

The similarity between documents and the input query is the inner product on the dimensions which both document vector and query vector have non-zero value, because only non-zero weighted elements can contribute the score of similarity. This matrix operation can save lots of time. Finally, using np.argpartition() function to obtain a list of the index of the top-ten unranked documents.