20 Jul 2024

Automate URL Submission to Google Search Index API: A Step-by-Step Guide in Python

Learn how to automate URL submissions to Google's Search Index API using a Python script.

Software Development
Automate URL Submission to Google Search Index API: A Step-by-Step Guide in Python

In this blog post, we’ll break down a Python script that automates the process of submitting URLs from a sitemap to Google’s Search Index API.

This script is particularly useful for webmasters and SEO professionals who want to ensure their website’s pages are promptly indexed by Google, especially if you have a large site with a lot of data (like thousands of pages).

Set Up the Environment

Here’s a list of the dependencies you’ll need to install via pip:

  • lxml
  • requests
  • google-auth
  • google-auth-oauthlib
  • google-auth-httplib2

It’s a good practice to use a virtual environment for your Python projects. Here’s a step-by-step guide to set up your environment:

  1. Create a virtual environment: python -m venv myenv
  2. Activate the virtual environment:
    • On Windows: myenv\Scripts\activate
    • On macOS and Linux: source myenv/bin/activate
  3. Install the required dependencies:
    • pip install requests google-auth google-auth-httplib2 lxml

Part 1: Write The Script!

Create The Script File

First, let’s import the necessary libraries and define our constants into a new file called indexer.py:

import requests
import csv
import os
from google.oauth2 import service_account
from google.auth.transport.requests import Request, AuthorizedSession
from lxml import etree

SCOPES = ["https://www.googleapis.com/auth/indexing"]
API_URL = "https://indexing.googleapis.com/v3/urlNotifications:publish"
SITEMAP_URL = "https://EXAMPLE.COM/sitemap.xml"
CSV_FILE = "submitted_urls.csv"
SERVICE_ACCOUNT_FILE = "./my-key.json"

These imports and constants set the stage for our script, defining the necessary API endpoints, file paths, and authentication scopes.

Authentication with Google API

To interact with the Google Search Index API, we need to authenticate our requests. We’ll use a service account for this purpose.

Add the following to the bottom of the indexer.py file.

def get_authenticated_session():
    try:
        credentials = service_account.Credentials.from_service_account_file(
            SERVICE_ACCOUNT_FILE, scopes=SCOPES
        )
        return AuthorizedSession(credentials)
    except Exception as e:
        print(f"Error creating authenticated session: {e}")
        raise

This function reads the service account JSON key file and creates an authenticated session that we’ll use for our API requests.

Managing Submitted URLs

To avoid duplicate submissions, we’ll keep track of URLs we’ve already submitted.

Add the following to the bottom of the indexer.py file.

def read_submitted_urls(csv_file):
    if not os.path.exists(csv_file):
        return set()
    
    with open(csv_file, mode='r') as file:
        reader = csv.reader(file)
        return {row[0] for row in reader}

def write_submitted_url(csv_file, url):
    with open(csv_file, mode='a', newline='') as file:
        writer = csv.writer(file)
        writer.writerow([url])

These functions read from and write to a CSV file, maintaining a record of submitted URLs.

Submitting URLs to Google Indexing API

Here’s the core function that submits a URL to Google’s Indexing API.

Add the following to the bottom of the indexer.py file.

def submit_url(authed_session, url):
    headers = {
        "Content-Type": "application/json"
    }
    data = {
        "url": url,
        "type": "URL_UPDATED"
    }
    response = authed_session.post(API_URL, headers=headers, json=data)
    if response.status_code == 403:
        print("Permission error. Please ensure your service account has the correct permissions.")
    elif response.status_code == 429:
        print(f"Rate limit exceeded: {url}, Status code: {response.status_code}, Response: {response.text}")
        return 429
    elif response.status_code != 200:
        print(f"Failed to submit: {url}, Status code: {response.status_code}, Response: {response.text}")
    return response.status_code

This function sends a POST request to the API for each URL and handles different response status codes, including rate limiting.

Parsing the Sitemap

To get the URLs we want to submit, we need to parse the sitemap, so add the following to the bottom of the indexer.py file.

def get_sitemap_urls(sitemap_url):
    response = requests.get(sitemap_url)
    if response.status_code != 200:
        print(f"Failed to fetch sitemap: {sitemap_url}, Status code: {response.status_code}")
        return []
    
    root = etree.fromstring(response.content)
    return [loc.text for loc in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}loc")]

This function fetches the sitemap XML and extracts all URLs found in the <loc> tags.

Putting It All Together

Finally, let’s look at the main function that orchestrates the entire process. Add this to the bottom of the file:

def main():
    authed_session = get_authenticated_session()
    submitted_urls = read_submitted_urls(CSV_FILE)
    sitemap_urls = get_sitemap_urls(SITEMAP_URL)

    for url in sitemap_urls:
        if url not in submitted_urls:
            status_code = submit_url(authed_session, url)
            if status_code == 200:
                write_submitted_url(CSV_FILE, url)
                print(f"Successfully submitted: {url}")
            elif status_code == 429:
                print("Stopping due to rate limit.")
                break
            else:
                print(f"Failed to submit: {url}, Status code: {status_code}")

if __name__ == "__main__":
    main()

The main() function:

  1. Creates an authenticated session
  2. Reads previously submitted URLs
  3. Fetches URLs from the sitemap
  4. Iterates through each URL, submitting those that haven’t been submitted before
  5. Handles successful submissions and potential errors
  6. Stops if it encounters a rate limit

Part 2: Setting Up Access to Google Search Index API

Before you can use the script we’ve discussed, you need to set up access to Google’s Search Index API. This process involves several steps to ensure secure and authorised access to the API.

Let’s walk through the setup process.

1. Create a Google Cloud Project

First, you need to create a project in the Google Cloud Console:

  1. Go to the Google Cloud Console.
  2. Click on the project drop-down and select “New Project”.
  3. Give your project a name and click “Create”.
  4. Make note of your Project ID, as you’ll need it later.

2. Enable the Indexing API

Once your project is created, you need to enable the Indexing API:

  1. In the Google Cloud Console, go to the “APIs \& Services” dashboard.
  2. Click on “+ ENABLE APIS AND SERVICES” at the top of the page.
  3. Search for “Indexing API” and select it.
  4. Click “Enable” to activate the API for your project.

3. Create a Service Account

To authenticate your requests to the API, you’ll need a service account:

  1. In the Google Cloud Console, navigate to “IAM \& Admin” > “Service Accounts”.
  2. Click “Create Service Account” at the top of the page.
  3. Give your service account a name and description.
  4. For the role, choose “Owner” (you can refine this later for better security).
  5. Click “Continue” and then “Done”.

4. Generate a JSON Key

After creating the service account, you need to generate a JSON key:

  1. Find your newly created service account in the list and click on it.
  2. Go to the “Keys” tab.
  3. Click “Add Key” and choose “Create new key”.
  4. Select “JSON” as the key type and click “Create”.
  5. The JSON key file will be downloaded to your computer. Keep this file secure and don’t share it publicly. SAVE THIS FILE TO THE LOCATION REFERENCED IN YOUR SCRIPT!

5. Configure Your Website Property

To use the Indexing API, you need to verify ownership of your website:

  1. Go to Google Search Console.
  2. Add your website as a property if you haven’t already.
  3. Verify ownership using one of the provided methods (e.g., HTML file upload, DNS record, etc.).

6. Grant Your Service Account Access in Search Console

A critical step in this process is to add your service account as an Owner of your property in Google Search Console. This is essential for the API to have the necessary permissions to submit URLs for indexing:

  1. Go to Google Search Console.
  2. Select the property you want to manage with the API.
  3. Click on “Settings” (gear icon) in the left sidebar.
  4. Select “Users and permissions” from the menu.
  5. Click the “Add user” button.
  6. In the “Add user” dialog:
    • Enter the email address of your service account. This will look like: your-service-account-name@your-project-id.iam.gserviceaccount.com
    • For the permission level, you must select “Owner”. This is crucial - the service account needs full owner permissions to use the Indexing API.
  7. Click “Add” to grant access.

It’s important to note:

  • The service account must be added as an Owner, not just a user with lower-level permissions.
  • You may need to wait a short while (usually a few minutes, but sometimes up to 24 hours) for the permissions to propagate fully after adding the service account.
  • If you’re using a property set in Search Console, make sure to add the service account as an owner to the property set, not just individual properties within it.

Security Reminder!

Remember that granting Owner permissions gives the service account extensive control over your Search Console property. Always keep your service account credentials secure and monitor its usage regularly. If you’re working in a team environment, make sure to follow your organization’s security protocols when managing these permissions.

Thank you for bringing attention to this crucial detail. It’s an essential step for ensuring that the script can function correctly with the Google Search Index API.

Part 3: Run the Script!

Now, if you’ve done everything correctly you should be able to run the script. Just type this in your terminal in the same folder the script file is located:

python indexer.py

This will print out each file that is indexed successfully and save it to the csv file. Once you hit the rate limit (200 submissions per day by default) the script will stop. Just re-run it the next day and I’ll start submitting.

Conclusion

This script provides an automated way to submit URLs from your sitemap to Google’s Search Index API. By running this script periodically, you can ensure that Google is always aware of your latest content, potentially improving your site’s visibility in search results.

Remember to set up your Google Cloud project, enable the Indexing API, and create a service account with the necessary permissions before using this script. Also, be mindful of Google’s quota limits for the Indexing API to avoid overuse.

By leveraging this automation, you can streamline your SEO workflow and focus on creating great content while ensuring it gets indexed promptly.

© 2024 Matthew Clarkson. All rights reserved.