Automating Related Blog Posts with Django, NLTK, and TF-IDF

One of the best ways to keep your audience engaged and on your site is to show them relevant content—such as related blog posts—right when they finish reading something they like. In this blog post, we’ll walk through a Django-based implementation that uses Natural Language Processing (NLP) and machine learning libraries to automatically figure out which blog posts are closely related to each other.

The code samples provided are using blogs published using the Wagtail CMS but it’s easily customisable for any text field.

Overview

Here is a quick outline of the approach:

Load and Pre-process Content: We fetch all published blog posts, convert their rich text fields into plain text, and store them for further processing.
Compute Text Similarities: We use TF-IDF (Term Frequency–Inverse Document Frequency) vectors and a cosine similarity matrix to measure how similar each blog post is to every other post.
Update Related Posts: Finally, for each blog post, we select the top few posts with the highest similarity scores and link them in our database as related content.

Let’s look at the code step by step.

NLTK Downloads

We need to ensure certain NLTK resources are available for text processing—most commonly, tokenisers, stopwords, and lemmatisers. We handle this in a small helper function:

import nltk

def ensure_nltk_downloads():
    """Ensure required NLTK data is downloaded"""
    downloads = ["punkt", "stopwords", "wordnet", "punkt_tab"]
    for item in downloads:
        nltk.download(item)

punkt: Tokeniser for sentences and words.
stopwords: A list of common words (like “the”, “an”, etc.) that we usually filter out.
wordnet: Used for lemmatisation, giving us more uniform handling of words (e.g., “runs”, “running”, and “ran” all map to “run”).
punkt_tab: A table that helps punkt handle certain abbreviations and punctuation rules.

Fetching and Processing Blog Post Content

The next step is reading all published posts from the database. We focus on the body of each post, converting any rich text blocks to plain text.

from web.models import BlogPage, BlogPageRelatedPosts

def process_posts_content(progress_callback=None, specific_post_id=None):
    """
    Process blog posts content for similarity analysis.
    Can process all posts or focus on a specific post and its potential relations.
    """
    # Get all published posts
    posts = BlogPage.objects.live().specific()

    if specific_post_id:
        # We might only want to update related posts for a single post,
        # but we still fetch all for comparison
        target_post = BlogPage.objects.get(id=specific_post_id)
        if not target_post.live:
            return [], [], []

    if not posts:
        return [], [], []

    contents = []
    post_ids = []
    posts_list = []

    for post in posts:
        # Skip posts with no body content
        if not post.body:
            continue

        # Convert StreamField content to plain text
        content_parts = []
        for block in post.body:
            # Handle RichText blocks by reading their source
            if hasattr(block, "value"):
                if hasattr(block.value, "source"):
                    content_parts.append(str(block.value.source))
                else:
                    content_parts.append(str(block.value))

        content = " ".join(content_parts)

        if content.strip():
            contents.append(content)
            post_ids.append(post.id)
            posts_list.append(post)

            if progress_callback:
                progress_callback(f"Processed post: {post.title}")

    return contents, post_ids, posts_list

Key points:

BlogPage.objects.live().specific() (note: Wagtail specific manager methods) grabs all published (live) blog posts with their specific model fields available.
For each post, we extract text from body blocks. If any block contains a RichText object, we read out its .source attribute to get the raw HTML or text representation.
We build a contents list (all post bodies), a post_ids list, and a posts_list of actual BlogPage objects.

Computing Similarities with TF-IDF and Cosine Similarity

We use TfidfVectorizer from scikit-learn to transform our text into numerical vectors, then compute their similarity using cosine_similarity.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def compute_similarities(contents):
    """Compute TF-IDF and similarity matrix"""
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(contents)
    return cosine_similarity(tfidf_matrix)

TfidfVectorizer: Creates a vector representation for each document (blog post) where the value of each word is weighted by how important it is across all documents.
cosine_similarity: Calculates pairwise similarity scores. Results in a matrix (2D array) where each row and column corresponds to a post, and the value is how similar those posts are.

Once we have our similarity matrix, we loop through each post and grab the top results. We then create or update related post objects in the database.

def update_related_posts(
    cosine_sim_matrix, post_ids, posts, num_relations=6, progress_callback=None
):
    total_posts = len(posts)

    for idx, post in enumerate(posts):
        if progress_callback:
            progress_callback(
                f'Updating related posts for {idx + 1}/{total_posts}: "{post.title}"'
            )

        # Similarity scores for the current post
        sim_scores = list(enumerate(cosine_sim_matrix[idx]))
        # Exclude the same post from its own related list
        sim_scores = [score for score in sim_scores if score[0] != idx]
        # Sort by highest similarity first
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        top_indexes = [i[0] for i in sim_scores[:num_relations]]
        related_post_ids = [post_ids[i] for i in top_indexes]
        related_posts = BlogPage.objects.filter(id__in=related_post_ids)

        # Remove existing related posts
        post.related_posts.all().delete()

        # Create new related posts entries
        for order, related_post in enumerate(related_posts):
            BlogPageRelatedPosts.objects.create(
                page=post, post=related_post, sort_order=order
            )

    return total_posts

num_relations=6: We limit to the top six related posts, but you can adjust it as needed.
We first enumerate similarities, remove the current post from its own list, then sort by similarity in descending order.
We query the actual BlogPage objects with id__in=related_post_ids and update the BlogPageRelatedPosts model accordingly.
Before we insert new data, we delete existing related post relationships to avoid duplicates.

Putting It All Together

You might combine these functions in a single management command, scheduled task, or an admin panel action that can be triggered manually. A rough outline could look like this:

def update_all_blog_related_posts():
    ensure_nltk_downloads()
    
    # 1. Fetch and process content
    contents, post_ids, posts = process_posts_content()
    if not contents:
        print("No content found.")
        return

    # 2. Calculate similarities
    cosine_sim_matrix = compute_similarities(contents)

    # 3. Update related post relationships
    total_posts = update_related_posts(cosine_sim_matrix, post_ids, posts)
    print(f"Successfully updated related posts for {total_posts} blog posts.")

With this in place, you can periodically run update_all_blog_related_posts() to keep your related post suggestions fresh or invoke it whenever you publish new content.

Conclusion

Building a system to automatically suggest related content enhances the user’s reading experience and encourages them to explore more of your blog. By leveraging NLTK for text processing and TF-IDF with cosine similarity, you can generate smart suggestions that keep readers engaged.

Key Takeaways:

Text Pre-processing: Extract, clean, and tokenise your content before feeding it to any vectoriser.
TF-IDF: Effectively weights how important words are across multiple documents.
Cosine Similarity: Provides a straightforward measure for how closely matched two vectors are.
Dynamic Updates: A scheduled task ensures your relationships remain relevant as new posts are added.

Feel free to extend and modify this approach—experiment with different vectorisers (like CountVectorizer, or adding n-grams), or try advanced NLP techniques like word embeddings for more nuanced similarity suggestions.