Skip to main content
  1. Posts/

Notes on Information Retrieval

·3 mins·

Information Retrieval #

Some notes on Information Resource Management @THU, mainly about IR models.

Boolean Model #

The Boolean Model (also known as the Pure Boolean Model) is a set-theoretic model. The basic idea is to represent documents and queries as sets of terms. The Boolean operators are used to combine the terms in the query to produce the set of documents that satisfy the query.

Boolean Model has no ranking as its result of a query is only “YES” or “NO”.

Fuzzy Model #

The Fuzzy Model (also known as the Extended Boolean Model) is another type of set-theoretic model. It introduces membership values to represent the degree of association of a term within a document.

In the Fuzzy Model, the relationship between each term and document is no longer strictly “present” or “absent,” but is described using membership values. This implies that a term can appear in a document to varying degrees, capturing the similarity between documents and queries more accurately.

The Fuzzy Model holds an advantage in handling fuzzy queries, as it allows for partial matching and fuzzy matching. It is suitable for more complex queries and more realistic information retrieval needs. The membership values in the Fuzzy Model typically range from $0$ to $1$, reflecting the importance or relevance of a term within a document.

Specifically, for a document’s degree of membership in the set of terms $A$, denoted as $v(A)$, it follows that:

  • $v(A \cup B) = \max(v(A), v(B))$
  • $v(A \cap B) = \min(v(A), v(B))$
  • $v(\neg A) = 1 - v(A)$

In practice, the membership function can be defined using term weights, such as term frequency, TF-IDF, and others.

Vector Space Model #

The Vector Space Model (VSM) is an algebraic model. It represents documents and queries as vectors in a multi-dimensional space. The similarity between documents and queries is measured by the cosine of the angle between the vectors.

The weight of a term in a document is typically computed using binary weights, raw term frequency or TF-IDF.

Probabilistic Model #

The Probabilistic Model is a rigorous formalized model designed to predict the probability of documents being relevant to a given query. This model ranks the retrieved documents based on the probabilities associated with the relevance of documents to the query. This model relies on accurate probability estimates (which can be challenging to obtain). The probabilistic model is currently a dominant retrieval model.

Usually we use Naive Bayes to estimate the probability of relevance of a document to a query. Consider the Bayes’ Theorem: $$ P(R|D) = \frac{P(D|R)P(R)}{P(D)} $$ We can classify a document $D$ as relevant if $$ \frac{P(D|R)}{P(D|NR)} > \frac{P(NR)}{P(R)} $$ Left-hand side is the likelihood of a document being relevant given the query, and right-hand side is the prior probability of a document being relevant. Do some math, we can get $$ \begin{aligned} \frac{P(D|R)}{P(D|NR)}&=\prod_{i: d_i = 1}\frac{p_i}{s_i} \cdot \prod_{i: d_i = 0}\frac{1-s_i}{1-p_i}\\ &=\prod_{i: d_i = 1}\frac{p_i(1-s_i)}{s_i(1-p_i)} \cdot \prod_{i}\frac{1-p_i}{1-s_i} \end{aligned} $$ So we can use $\prod_{i: d_i = 1}\frac{p_i(1-s_i)}{s_i(1-p_i)}$ to estimate how likely a document is relevant to a query.

When it comes to calculating the score function, we have $$ \frac{p_i(1-s_i)}{s_i(1-p_i)} = \frac{\frac{pi}{1-pi}}{\frac{s_i}{1-s_i}} = \frac{\frac{r}{R-r}}{\frac{n-r}{(N-n) - (R-r)}} $$ This is Robertson-Sparck Jones weighting scheme and in practice a $0.5$ is added to the numerator and denominator to avoid division by zero.