TF-IDF Model
Palavras-chave:
Publicado em: 03/08/2025TF-IDF: Understanding and Implementing Term Frequency-Inverse Document Frequency
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It's often used as a weighting factor in information retrieval and text mining. This article will guide you through the core concepts of TF-IDF and provide a practical Python implementation.
Fundamental Concepts / Prerequisites
Before diving into the implementation, it's crucial to understand the following:
- Term Frequency (TF): Measures how frequently a term appears in a document. A common formula is: TF(t,d) = (Number of times term t appears in document d) / (Total number of terms in document d).
- Inverse Document Frequency (IDF): Measures how important a term is. While calculating TF, all terms are considered equally important. However, it is known that certain terms, such as "is", "of", and "the", appear so often that they overshadow other meaningful terms in the document. Thus, IDF penalizes these common terms. A common formula is: IDF(t,D) = log(Total number of documents in corpus D / Number of documents with term t in them). Adding 1 to the denominator is common to prevent division by zero.
- Stop words: These are extremely common words (like "a", "an", "the", "is") that carry little meaning and are often removed from text before processing.
- Tokenization: The process of breaking down text into individual words or tokens.
Core Implementation
Here's a Python implementation of the TF-IDF calculation. We'll use NumPy for efficient calculations.
import numpy as np
import math
from collections import Counter
import re
def tf(document):
"""
Calculates the term frequency (TF) for each term in a document.
Args:
document: A string representing the document.
Returns:
A dictionary where keys are terms and values are their TF scores.
"""
words = re.findall(r'\w+', document.lower()) # Tokenize and lowercase
word_counts = Counter(words)
total_words = len(words)
tf_scores = {}
for word, count in word_counts.items():
tf_scores[word] = count / total_words
return tf_scores
def idf(documents):
"""
Calculates the inverse document frequency (IDF) for each term in a corpus.
Args:
documents: A list of strings, where each string is a document.
Returns:
A dictionary where keys are terms and values are their IDF scores.
"""
num_documents = len(documents)
term_document_frequency = {} # Count of documents containing the term
for document in documents:
words = set(re.findall(r'\w+', document.lower())) # Unique words in the document
for word in words:
term_document_frequency[word] = term_document_frequency.get(word, 0) + 1
idf_scores = {}
for word, doc_count in term_document_frequency.items():
idf_scores[word] = math.log(num_documents / (doc_count + 1)) # Add 1 for smoothing
return idf_scores
def tf_idf(document, documents):
"""
Calculates the TF-IDF score for each term in a document.
Args:
document: A string representing the document.
documents: A list of strings, where each string is a document in the corpus.
Returns:
A dictionary where keys are terms and values are their TF-IDF scores.
"""
tf_scores = tf(document)
idf_scores = idf(documents)
tf_idf_scores = {}
for word, tf_score in tf_scores.items():
idf_score = idf_scores.get(word, 0) # Handle words not in IDF vocabulary
tf_idf_scores[word] = tf_score * idf_score
return tf_idf_scores
# Example usage:
documents = [
"This is the first document.",
"This is the second second document.",
"And this is the third one.",
"Is this the first document?"
]
document = documents[0] # Analyze the first document
tf_idf_scores = tf_idf(document, documents)
print(f"TF-IDF scores for document: '{document}'")
for word, score in sorted(tf_idf_scores.items(), key=lambda item: item[1], reverse=True):
print(f"{word}: {score}")
Code Explanation
`tf(document)`: This function calculates the Term Frequency (TF) for each word in a given document. It first tokenizes the document by converting it to lowercase and using regular expressions to extract words. Then, it counts the occurrences of each word. Finally, it calculates the TF score by dividing the count of each word by the total number of words in the document.
`idf(documents)`: This function calculates the Inverse Document Frequency (IDF) for each term in a corpus of documents. It iterates through the documents and keeps track of how many documents contain each unique word. After iterating through all the documents, it calculates the IDF score for each word using the formula: log(Total number of documents / (Number of documents containing the word + 1)). Adding 1 to the denominator prevents division by zero and provides smoothing, especially when dealing with smaller datasets.
`tf_idf(document, documents)`: This function calculates the TF-IDF score for each term in a document, given the document and the corpus of documents. It first calls the `tf()` function to get the TF scores for the document and the `idf()` function to get the IDF scores for the corpus. Then, it iterates through the TF scores and multiplies each TF score by the corresponding IDF score. It handles cases where a word might be present in the document but not in the IDF vocabulary (meaning it wasn't seen in any other documents in the corpus) by assigning a default IDF score of 0 in such instances.
Complexity Analysis
Time Complexity:
- `tf(document)`: O(N), where N is the number of words in the document. The regular expression `findall` and counting using the `Counter` are linear operations.
- `idf(documents)`: O(M * K), where M is the number of documents in the corpus and K is the average number of words per document. We iterate through each document and, on average, K words in the `set(re.findall...)` call.
- `tf_idf(document, documents)`: O(N + M * K). TF is O(N) and IDF is O(M * K). Iterating and multiplying the scores is O(N). The dominant factor is typically the IDF calculation due to iterating through the entire corpus.
Space Complexity:
- `tf(document)`: O(V), where V is the number of unique words in the document. We store the word counts in a dictionary.
- `idf(documents)`: O(U), where U is the number of unique words in the entire corpus. We store the document frequency and the IDF scores in dictionaries.
- `tf_idf(document, documents)`: O(V), where V is the number of unique words in the document. This space is dominated by storing the tf-idf scores for that single document.
Alternative Approaches
Scikit-learn's `TfidfVectorizer` provides an optimized and feature-rich implementation of TF-IDF. This approach can be much faster for larger datasets, as it leverages optimized data structures and algorithms. It handles tokenization, stop word removal, and other preprocessing steps. However, using `TfidfVectorizer` increases dependencies on external libraries, and you lose some control over the exact implementation details.
Conclusion
TF-IDF is a powerful technique for weighting terms in documents, allowing us to identify the most important words in a corpus. This article provided a fundamental understanding of TF-IDF and a practical Python implementation. Remember to consider using optimized libraries like Scikit-learn for large-scale applications and to adapt the implementation to your specific needs, such as incorporating stop word removal or stemming.