【WORD REPRESENTATION】TF-IDF
TF-IDF
Short Introduction
TF-IDF(Term Frequency — Inverse Document Frequency) is a measure of the importance of a word to a document in a collection or corpus, adjusted for the fact that some words frequently in general. It includes two parts, TF (Term Frequency) and IDF (Inverse Document Frequency). And they give two intuitively view that:
- If a word appears more frequently in a document, we can say the word is more important to this document. We say this as TF;
- If a word appears more frequently in many different documents, we can say the word is more like a common word, i.e., the word is less important to this specific document We say this as DF (NOT IDF);
- IDF is the inverse of DF.
Finally, TF-IDF is the multiplicative value of TF and IDF.
Details & Mathematical Expression
TF, the relative frequency of term $t$ within document $d$: \(\rm tf\it (t,d) = \displaystyle\frac{f_{t,d}}{\displaystyle\sum_{t'\in d}f_{t',d}}\) where $f_{t,d}$ is the number of times that term $t$ occurs in document $d$. There are various other ways to define TF:
- the raw count: $\rm tf\it(t,d)=f_{t,d}$;
- boolean frequencies: $\rm tf\it(t,d)=1$ if $t$ occurs in d and $0$ otherwise;
- logarithmically scaled frequency: $\rm tf\it(t,d)=\log(1+f_{t,d})$;
- augmented frequency, to prevent a bias towards longer documents, $\rm tf\it(t,d)=0.5+0.5\cdot\displaystyle\frac{f_{t,d}}{\max{f_{t’,d}:t’\in d}}$;
IDF, the measure of if a word is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contains the word. \(\rm idf\it(t,D)=\log\displaystyle\frac{N}{n_t}\) where $N$ is the total number of documents in the corpus $N=|D|$, $n_t=|{d\in D:t\in d}|$ is the number of documents where the term $t$ appears (i.e., $\rm tf\it(t,d)\neq 0$). If the term is not in the corpus, this will lead to a division-by-zero. So it is common to adjust the numerator $1+N$ and denominator to $1+n_t$. There are also some various other ways of IDF:
- inverse document frequency smooth: $\log\left(\displaystyle\frac{N}{1+n_t}\right)+1$;
- inverse document frequency max: $\log\left(\displaystyle\frac{\displaystyle\max_{t’\in d}n_{t’}}{1+n_t}\right)$;
- probabilistic inverse document frequency: $\log\displaystyle\frac{N-n_t}{n_t}$.
TF-IDF is calculated as $\rm tfidf\it (t,d,D)=\rm tf\it(t,d)\cdot\rm idf\it(t,D)$ since the ratio inside the IDF’s log function is always greater than or equal to 1, the value of IDF (and TF-IDF) is greater than or equal to 0.
Implement
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import re
from math import log
from pathlib import Path
from collections import Counter, defaultdict
"""
params: path: the folder where the documents are;
return: a list of documents, each document represented by a dict, contains the file name of the document and the tf, idf, and tf-idf for each word;
This example consider the English documents only, for Chinese, it will be diffenert only at the step of split words.
"""
def split_word(data):
# consider English data only
data = re.sub("\s+", " ", data.lower())
data = "".join(re.findall("[a-z A-Z]", data))
return data.strip().split(" ")
def tf_idf(path):
p = Path(path)
documents = list(p.rglob("**/*.txt"))
return_list = [{"File Name": document, "words": defaultdict(list)} for document in documents]
word_appeariance = defaultdict(int)
for idx, document in enumerate(documents):
with open(document, "r", encoding="utf-8")as r: data = r.read()
word_list = split_word(data)
n = len(word_list)
for word, num in Counter(word_list).items():
return_list[idx]["words"][word] = dict()
word_appeariance[word] += 1
tf = num / n
return_list[idx]["words"][word]["TF"] = tf
for idx, document in enumerate(documents):
for word, info in return_list[idx]["words"].items():
tf = info['TF']
word_appear_time = word_appeariance[word]
idf = log(len(documents) / max(1, word_appear_time), 2)
return_list[idx]["words"][word]['IDF'] = idf
return_list[idx]["words"][word]['TF-IDF'] = tf * idf
return return_list
Merit
It is easy for implement. It makes it easier to search for relevant documents or informations. And can improve the accuracy of text classification.
Demerit
TF-IDF assumes that each term is independent of other terms, which may not be true in some cases; It doesn’t take into account the context in which the terms are used, which can lead to inaccurate weights assigned to certain terms; It can affected by the length of the document or corpus, as longer documents may have more frequent terms even if they are not important.