Python-Powered Sitemaps.
A sitemap is an essential tool for SEO optimization as it helps search engines understand the structure of your website and index your pages more efficiently. By providing a clear and organized map of your website's content, a sitemap ensures that search engines can easily crawl and discover all the relevant pages on your site. With a well-structured sitemap, search engines can prioritize indexing your most important pages, which can significantly improve your website's visibility in search engine result pages (SERPs).
Python Knowledge Base: Make coding great again.
- Updated:
2024-11-20 by Andrey BRATUS, Senior Data Analyst.
Python Sitemap Generator code:
Conclusion:
Sitemaps also play a crucial role in helping search engines understand the relationship between different pages on your site, such as parent-child relationships or hierarchical structures.
When you update or add new content to your website, having a sitemap ensures that search engines are promptly notified about these changes, leading to faster indexing and better visibility.
It can also help with optimizing your website's internal linking structure by highlighting important pages and ensuring that they receive proper link equity from other pages on your site.
Including a sitemap on your website demonstrates to search engines that you care about user experience and accessibility, as it makes it easier for both search engines and users to navigate and find relevant information.
Sitemaps are particularly beneficial for larger websites with numerous pages, as they provide a systematic way to organize and present content, reducing the chances of important pages being overlooked by search engines.
By guiding search engine crawlers through your website's structure, this special file can help identify and fix any potential crawl errors or broken links, improving the overall health and performance of your site.
The logic of the Python code below is really simple:
- It should scan all pages, each page should be scanned only once. First you need pip install BeautifulSoup which will perform the scanning.
- There is a limit number of scanned pages by 500, you can correct it according to your needs.
- Results should be from own site domain only.
- Pages URLs should not contain '?page=' to avoid non canonical pagination URLs, you can disable this filter.
- Output should be in xml format with all fields recommended by Google.
- Output should contain lastmod tag, in this code they randomly set to scanning date, day before scanning date or 2 days before scanning. Feel free to correct.
import requests
from bs4 import BeautifulSoup
import random
import datetime
def generate_sitemap(domain, limit=500):
sitemap = '\n'
sitemap += '\n'
scanned_pages = set()
scanned_count = 0
def scan_page(url):
nonlocal scanned_count
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
scanned_pages.add(url)
scanned_count += 1
for link in soup.find_all('a'):
if scanned_count >= limit:
break
href = link.get('href')
if href and domain in href and '?page=' not in href and href not in scanned_pages:
scan_page(href)
scan_page(domain)
for page in scanned_pages:
lastmod_date = get_lastmod_date()
sitemap += '\t\n'
sitemap += f'\t\t{page} \n'
sitemap += f'\t\t{lastmod_date} \n'
sitemap += '\t \n'
sitemap += ' '
return sitemap
# Function to get the date for the tag
def get_lastmod_date():
option = random.randint(1, 3)
if option == 1:
lastmod_date = datetime.date.today()
elif option == 2:
lastmod_date = datetime.date.today() - datetime.timedelta(days=1)
else:
lastmod_date = datetime.date.today() - datetime.timedelta(days=2)
return lastmod_date
# Example usage
domain = 'https://python-code.pro/'
sitemap = generate_sitemap(domain)
print(sitemap)
In conclusion, having a well-optimized sitemap is vital for SEO as it ensures that search engines can effectively crawl, index, and understand your website's content, ultimately leading to improved visibility, higher rankings, and increased organic traffic.