How to check websites' HTML using Python.

Why should you check HTML ?


Most of the modern websites are made by modern frameworks or tools and by default we expect them to have the correct markup. But the problem is that when we add CSS, Bootstrap and Javascript to traditional HTML surprise problems happen.
And even more problems appear when we add third-party code like banner network fragments, counters and trace code parts, external videos, etc...
HTML markup mistakes will definetely ruin you search engines ranking and should be eliminated.

HTML validator.



How will we validate markup ?


There is a nice and free online service to validate URL's markup correctness Nu Html Checker wich provides simple API to perform HTML check and we will use Python requests for our purpose. In this case you even don't have to get your personal API key, but please be carefull with spamming to many requests at a time.

As a result of URL validation you will get a number of mistakes, warnings or non-critical comments, in ideal case you should rid off all mistakes and warnings.

Output data is in JSON format which can be easily converted to any suitable form for further visualisation and analysis, HTML check data will definetely help you to increase your websites ranking position.



Python code to check websites' HTML:


Hint: you should specify a websites' URL. You can also upgrade the code to enable validation of several URLs using cycle.



import requests

validaror_url = "https://validator.w3.org/nu/"
ip_address = "https://python-code.pro/"

params = { "doc": ip_address, "out": "json"}

response = requests.get(validaror_url,
                        params=params)

response.json()


HTML validation data output:



{'url': 'https://python-code.pro/',
 'messages': [{'type': 'info',
   'lastLine': 44,
   'lastColumn': 62,
   'firstColumn': 1,
   'message': 'Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.',
   'extract': 'grammer">\n\n,
   'hiliteStart': 10,
   'hiliteLength': 62},

],
 'language': 'en'} 




See also related topics: