Automating Related Blog Posts in Django
One of the best ways to keep your audience engaged and on your site is to show them relevant content—such as related blog posts—right when they finish reading something they like. In this blog post, we’ll walk through a Django-based implementation that uses Natural Language Processing (NLP) and machine learning libraries to automatically figure out which blog posts are closely related to each other.
The code samples provided are using blogs published using the Wagtail CMS but it’s easily customisable for any text field.
Overview
Here is a quick outline of the approach:
- Load and Pre-process Content: We fetch all published blog posts, convert their rich text fields into plain text, and store them for further processing.
- Compute Text Similarities: We use TF-IDF (Term Frequency–Inverse Document Frequency) vectors and a cosine similarity matrix to measure how similar each blog post is to every other post.
- Update Related Posts: Finally, for each blog post, we select the top few posts with the highest similarity scores and link them in our database as related content.
Let’s look at the code step by step.
NLTK Downloads
We need to ensure certain NLTK resources are available for text processing—most commonly, tokenisers, stopwords, and lemmatisers. We handle this in a small helper function:
import nltk
def ensure_nltk_downloads():
"""Ensure required NLTK data is downloaded"""
downloads = ["punkt", "stopwords", "wordnet", "punkt_tab"]
for item in downloads:
nltk.download(item)
punkt
: Tokeniser for sentences and words.stopwords
: A list of common words (like “the”, “an”, etc.) that we usually filter out.wordnet
: Used for lemmatisation, giving us more uniform handling of words (e.g., “runs”, “running”, and “ran” all map to “run”).punkt_tab
: A table that helpspunkt
handle certain abbreviations and punctuation rules.
Fetching and Processing Blog Post Content
The next step is reading all published posts from the database. We focus on the body
of each post, converting any rich text blocks to plain text.
from web.models import BlogPage, BlogPageRelatedPosts
def process_posts_content(progress_callback=None, specific_post_id=None):
"""
Process blog posts content for similarity analysis.
Can process all posts or focus on a specific post and its potential relations.
"""
# Get all published posts
posts = BlogPage.objects.live().specific()
if specific_post_id:
# We might only want to update related posts for a single post,
# but we still fetch all for comparison
target_post = BlogPage.objects.get(id=specific_post_id)
if not target_post.live:
return [], [], []
if not posts:
return [], [], []
contents = []
post_ids = []
posts_list = []
for post in posts:
# Skip posts with no body content
if not post.body:
continue
# Convert StreamField content to plain text
content_parts = []
for block in post.body:
# Handle RichText blocks by reading their source
if hasattr(block, "value"):
if hasattr(block.value, "source"):
content_parts.append(str(block.value.source))
else:
content_parts.append(str(block.value))
content = " ".join(content_parts)
if content.strip():
contents.append(content)
post_ids.append(post.id)
posts_list.append(post)
if progress_callback:
progress_callback(f"Processed post: {post.title}")
return contents, post_ids, posts_list
Key points:
BlogPage.objects.live().specific()
(note: Wagtail specific manager methods) grabs all published (live
) blog posts with their specific model fields available.- For each post, we extract text from
body
blocks. If any block contains aRichText
object, we read out its.source
attribute to get the raw HTML or text representation. - We build a
contents
list (all post bodies), apost_ids
list, and aposts_list
of actualBlogPage
objects.
Computing Similarities with TF-IDF and Cosine Similarity
We use TfidfVectorizer
from scikit-learn to transform our text into numerical vectors, then compute their similarity using cosine_similarity
.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def compute_similarities(contents):
"""Compute TF-IDF and similarity matrix"""
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(contents)
return cosine_similarity(tfidf_matrix)
TfidfVectorizer
: Creates a vector representation for each document (blog post) where the value of each word is weighted by how important it is across all documents.cosine_similarity
: Calculates pairwise similarity scores. Results in a matrix (2D array) where each row and column corresponds to a post, and the value is how similar those posts are.
Updating Related Posts in the Database
Once we have our similarity matrix, we loop through each post and grab the top results. We then create or update related post objects in the database.
def update_related_posts(
cosine_sim_matrix, post_ids, posts, num_relations=6, progress_callback=None
):
total_posts = len(posts)
for idx, post in enumerate(posts):
if progress_callback:
progress_callback(
f'Updating related posts for {idx + 1}/{total_posts}: "{post.title}"'
)
# Similarity scores for the current post
sim_scores = list(enumerate(cosine_sim_matrix[idx]))
# Exclude the same post from its own related list
sim_scores = [score for score in sim_scores if score[0] != idx]
# Sort by highest similarity first
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
top_indexes = [i[0] for i in sim_scores[:num_relations]]
related_post_ids = [post_ids[i] for i in top_indexes]
related_posts = BlogPage.objects.filter(id__in=related_post_ids)
# Remove existing related posts
post.related_posts.all().delete()
# Create new related posts entries
for order, related_post in enumerate(related_posts):
BlogPageRelatedPosts.objects.create(
page=post, post=related_post, sort_order=order
)
return total_posts
num_relations=6
: We limit to the top six related posts, but you can adjust it as needed.- We first enumerate similarities, remove the current post from its own list, then sort by similarity in descending order.
- We query the actual
BlogPage
objects withid__in=related_post_ids
and update theBlogPageRelatedPosts
model accordingly. - Before we insert new data, we delete existing related post relationships to avoid duplicates.
Putting It All Together
You might combine these functions in a single management command, scheduled task, or an admin panel action that can be triggered manually. A rough outline could look like this:
def update_all_blog_related_posts():
ensure_nltk_downloads()
# 1. Fetch and process content
contents, post_ids, posts = process_posts_content()
if not contents:
print("No content found.")
return
# 2. Calculate similarities
cosine_sim_matrix = compute_similarities(contents)
# 3. Update related post relationships
total_posts = update_related_posts(cosine_sim_matrix, post_ids, posts)
print(f"Successfully updated related posts for {total_posts} blog posts.")
With this in place, you can periodically run update_all_blog_related_posts()
to keep your related post suggestions fresh or invoke it whenever you publish new content.
Conclusion
Building a system to automatically suggest related content enhances the user’s reading experience and encourages them to explore more of your blog. By leveraging NLTK for text processing and TF-IDF with cosine similarity, you can generate smart suggestions that keep readers engaged.
Key Takeaways:
- Text Pre-processing: Extract, clean, and tokenise your content before feeding it to any vectoriser.
- TF-IDF: Effectively weights how important words are across multiple documents.
- Cosine Similarity: Provides a straightforward measure for how closely matched two vectors are.
- Dynamic Updates: A scheduled task ensures your relationships remain relevant as new posts are added.
Feel free to extend and modify this approach—experiment with different vectorisers (like CountVectorizer
, or adding n-grams), or try advanced NLP techniques like word embeddings for more nuanced similarity suggestions.