In October’s issue I showed how to develop an HTML Container class. This month, we will use that class to develop a general purpose Web Crawler class. The HTML Container project, including a VB.NET version, can be downloaded from the VSJ web site.
Before getting started you will need to add the HTML Container class (WebWagon.dll) to your project. From the menu, choose Projects|Add Reference. Click the Projects tab and then the Browse button. Navigate to the location of WebWagon.dll and click OK.
A Web Crawler – sometimes referred to as a spider or robot – is a process that visits a number of web pages programmatically, usually to extract some sort of information. For example, the popular search engine Google has a robot called googlebot that sooner or later visits virtually every page on the Internet for the purpose of indexing the words on that page. We are going to develop a general-purpose class that can be used as a basis for writing any type of robot. This class will be simple yet powerful. The heart of the class is a method called CrawlURL, which accepts a beginning URL. The contents of this URL will be loaded into the HTML Container class. For each link found on this page, it recurs back again, thus repeating the process for each of those pages and so on.
The basic process is pretty simple, but we must add a few more features in order to avoid spiralling into an infinite loop. First of all, we want to keep track of pages that we’ve already seen. Many sites are such that Page A points to Page B which points back again to Page A – and the routine will soon be chasing its own tail if we don’t prevent it from doing so. A related problem has to do with the allowable recursion level. If left unbound, many sites will cause the crawler to dig itself into a hopelessly deep hole, using large amounts of stack space and memory. It is also possible to encounter a submission form that will point back to itself, using a parameterized URL that fools the check for already encountered links. For the sake of usability, we want to restrict the range so we can target a specific site or group of sites, while ignoring everything else. Finally, as we will see later, Robot etiquette requires that we maintain an Exclude list so we might as well expose this function to the object caller as well.
The first problem could be solved in a number of ways. We could have an array or collection representing the known URLs and a routine that would check to see if a given URL was already in the list. Actually we can save a bit of work by using the .Net Queue class. This class has two methods that will be useful for this purpose – the Enqueue method, which adds an item to a queue, and the Contains method, that returns True if a given item is in the queue or False if not – exactly the functions needed for the task at hand.
A property will be added to set or indicate the maximum recursion level. We will write a private version of CrawlURL, which accepts a URL as does the public method, but also expects a recursion level parameter. The public interface will simply invoke the private method, passing a zero and the process will continue from there. When the maximum level has been reached, we will stop the process. When this happens, we don’t want to just throw away the links. We may drop pages that are not accessible from another path if we do. We will use another queue that is populated with links that would have been visited had we not exceeded the maximum level. These links will in turn be visited whenever the initial recursion has completed and the process will start again. A lower recursion level will save memory, but not allow any links to be missed completely.
The Include list will be optional. If none is specified, any URL will be allowed unless found on the Exclude list. If one or more are present, the URL in question will be required to match the Include. A partial path will be allowed, so the path:
http://www.bbc.co.uk/sports…will include or exclude (depending on the list) anything that matches to that point.
A well-mannered robotAs with just about everything in life, there are certain rules that should be followed when writing a robot. There are two important bits of information that every robot should check before visiting a specific link. The first of these is the robots.txt file. This file may or may not be present – but if it is, it will be found in the root directory of the server. For example, the BBC’s robot.txt file can be found at:
http://www.bbc.co.uk/robots.txtA robots.txt file consists of two parts – the robot name and the list of excludes meant for that specific robot. An asterisk (*) is used to indicate any robot not explicitly named – us for example. Here is a sample entry from the BBC robots.txt file:
User-agent: * Disallow: /cgi-binThis tells us that we should not index any files found in the /cgi-bin path, which for this page would mean everything that begins with:
http://www.bbc.co.uk/cgi-bin/Two important things should be noted about the robots.txt file. First of all, it has nothing to do with security. You can, if you wish, ignore the robots.txt file and traverse to your heart’s content – but it is not considered polite to do so. Secondly, it is often the case that the excluded paths contain duplicated or otherwise uninteresting information that you probably didn’t want to visit anyway. Often the robots.txt file is maintained by the webmaster as a courtesy to the robot writer rather than a hindrance.
While the robots.txt file applies to the entire site, a given page can also have meta-tags that contain robot information. There are two of these that we should be concerned with here – NoIndex and NoFollow, which may appear together, separately or not at all. Here is an example of a meta-tag that contains both values:
The NoIndex attribute requests that text not be indexed from that page. The NoFollow attribute requests that none of the links on that page be crawled.
We are just about ready to write the CrawlURL method, but first let’s take a look at the HTML Container class. There are two methods that we will find particularly useful: LoadSource and GetHRefs. The first of these will grab a document given a URL. The second will extract every HRef element from the anchor tags on that page. Perfect – except that it would have been nice if the HTML Document class had included NoIndex and NoFollow properties to save us the trouble of checking them ourselves.
Extending the HTML container classWell, of course there is an easy solution – inheritance. We will create a new HTML Container class that will inherit the original, while adding two additional properties: NoIndex, True if a NoIndex meta-tag is present, and NoFollow, True if a NoFollow meta-tag is present. Recalling the HTML Document class again, we find the LoadStatus event, which will raise with a Description of either “Complete” or “Error” when the page has completed loading or failed to load. This sounds like a good place to check the document’s meta-tags. We will write a private routine SetRobotsFlags, which will set modular level Boolean variables corresponding to the NoIndex and NoFollow properties depending on the presence or absence of their corresponding meta-tags. When a page has been loaded successfully, this routine will be called to set the flags. In the result of an error, we will set both of these flags to True.
Defining the classes and eventsTo create a new class based on an existing class we declare the class, and then use the Inherited keyword for VB.NET or suffixing the class name with a colon, followed by the base class for C#. The modified HTML Container will be a stand-alone class – but contained in the same namespace as the WebCrawler class:
See full detail: http://www.vsj.co.uk/articles/display.asp?id=402