This assignment is all about finding the right libraries as well as the right level of expressiveness to code what seems like a very complex task in very few lines of (still readable!) code.
First, you should write a Python program which makes a Google custom search query to get (the names of) files containing HTML -- you might even want to make your query more precise and get certain flavour(s) of HTML. You will find this SO answer useful. You are to take either 100 files from the web (obtained from your query), or 20 sites (and thus all html files you can find there), and run these files/sites through the W3C HTML Validator. You can use the validator locally, but you get extra points if you can use the online version of the Validator hosted on the W3C site. [In other words, you don't even need to download the files.] Grab all the errors and warnings you get, and collect some statistics about this "bad html", and output these statistics. Your code should be no longer than 100 lines of well-formatted code.
Beware: the Google search API only allows 100 free searches per day, so do not debug your code by massive testing, as you will get billed for it! Also, don't blast too much to the W3C site all at once as you'll get blacklisted.