Mobile Maps
Home - Download
Developers - News
Services - Contacts

Mobilemaps Nearby-Engine: Spider README

(C)2001 High Country Software Ltd.

http://www.mobilemaps.com

Installation instructions for the Mobilemaps Spider can be found in the INSTALL text file in this directory.

Using the Spider

The Mobile Maps Spatial Spider is an intelligent system that provides a postal address or geographic grid reference for organisations on the Web.

Once left to its own devices, the spider can continue to scan the Web indefinitely. The rich output data can be used to:

a) Automatically build 'Yellow Pages' style directories.

b) Convert existing URL-only Web directories into Location-Based-Services (LBS) that allow map and address queries.

c) Maintain Yellow Pages directories by obtaining the most recent addresses of organisations without any labour cost.

d) Create advanced KISS (Keyword Indexed Spatial Search) services.

First Session

Type a seed URL such as http://www.mobilemaps.com. Select 5 spiders, and then press 'Start/Continue'. Wait as the program spiders the Web and builds up a database of addresses. Check progress with 'Addresses Found'.

After perhaps half an hour, 'Stop' the spiders and click 'File->Export Addresses'. Export all fields, and then open the 'export.txt' file in a text editor or spreadsheet such as Excel. Select 'Tab delimited' when importing the file.

You should see the addresses, URLs and other information extracted from the websites.

File Menu

-Import URL List- Allows you to import a new-line delimited list of URLs from a text file. These URLs are the first read by the spider if the database is fresh. An example file might look like:

http://www.mobilemaps.com
http://www.gstart.com
http://www.yahoo.co.uk
http://www.mobilemaps.com/misc/contact.htm

-Export Addresses- Once the spider has been operating for a period of time, the 'Addresses Found' label may show 1 or more addresses. You can stop the spider at any time and select 'Export Addresses' to view the results in a tab-delimited text file. This file can be imported into any text editor, spreadsheet or database.

You can choose which fields to export. Note: some fields can be large eg. Body Text.

Once you have exported, you can choose to continue running the spider from where it was stopped.

-Clear Database- Removes any existing URLs, addresses, or other fields in the Spatial Spider database. This should only be done when you are preparing for a new spidering project and wish to start from scratch.

The database is automatically saved if the machine is restarted or the application closed, so you can run multiple sessions from where the spiders left off. The ongoing storage is only reset by selecting 'Clear Database'.

-Storage Options- Lets you restrict the fields that the spider will store in the database. This is particularly important if the spider is to run for an extended period of time, as some fields can consume heavy disk resources. 'Body Text' is an example of a wide text field (500 chars)that can be switched off.

This option is only available when the database has been cleared of old addresses and URLs. It should be set before starting a spider session, and cannot be changed during a session.

-Exit- Will exit the program, send a request to stop the existing spiders, and save the database of URLs and addresses.

You will notice that each spider takes a short period of time before disappearing. This is expected, because the spider must finish it's current URL before stopping. If the URL content on the page is large, it can take an extended period before stopping.

Options

-Seed URL- If no URL lists have been imported, this address is where the spider(s)first start.

-Spider Depth- 1 = Look at the first page of each new site 2 = Look at the first page and the 2nd pages referenced off the first page 3 = Look at the first page, 2nd and 3rd pages deep etc. No limit = Continue to spider the whole site.

Hints: There is a law of diminishing returns. If the whole site is spidered, it may still only find one address that represents that site. Therefore, 'No limit' provides higher accuracy, but takes longer.

-Number of Spiders- Each spider will start a small application that runs in a minimized state. The spiders continue in parallel and head off to different web-sites at the same time. The more spiders you start, the more local PC resources the spiders will consume, but generally the faster they will read URLs.

Warning: If you are starting from a single seed URL, you should not start too many spiders at once since they will collectively request information from the target website over and over again. System administrators might decide to block your spider if it is slowing down their site.

-Spider Remote Sites- This option lets you limit the spiders to the domain(s) that you have specified. This provides an efficient means to spider just one site completely without jumping to anywhere on the Web.

-Domain Filter- By entering a domain filter eg. "uk", you can restrict the URLs that are read to those that include the text "uk" somewhere in the URL itself. Eg. http://www.yahoo.co.uk would be a valid address to read. This is particularly useful to limit the addresses to a particular country. If you have a list of URLs it can also be used to speed up the detection

-Starting the Spider- Click 'Start/Continue'. One or more spiders should appear minimized on the task-bar. You can check progress at the 'URLs Stored' label, or by maximizing the spiders themselves to see which URLs they are currently reading.

Starting the spider continues from where you last stopped it. You cannot modify options once you have started a spider. You must press 'Stop', modify the options, and then press 'Start/Continue' again.

-Stopping the Spider- This will stop or pause the spider and let you configure any options before restarting again. You will notice that each spider takes a short period of time before disappearing. This is expected, because the spider must finish it's current URL before stopping. If the URL content on the page is large, it can take an extended period before stopping.

You should wait for all of the spiders to stop before restarting again.

-Rewind the Spider- This will send all spiders back to the start of the list of URLs. It is used to rescan the same URLs and check for any additional pages or changes in addresses.

On the Web, servers can be out of operation for certain periods or network traffic may be too high and URLs are not read. On a second pass, the spider may find the addresses missed on the first pass.

-Watching progress- There are several continually updating labels that keep you informed of progress.

-URLs Read- The number of URLs that have actually been processed by the Spider.

-URLs Stored- The number of URLs that have been detected off web-pages or imported. Note: this number can actually decrease as a broken link will eventually be removed from the list.

-Addresses Found- A count of the URLs that include addresses. You should watch this carefully, as this represents the size of your 'Yellow Pages' database.

Using Your Spiders Responsibly

Be careful starting a large number of spiders from one 'Seed URL'.

Warning: If you are starting from a single seed URL, the spiders will collectively request information from the target website over and over again. System administrators might decide to block your spider if it is slowing down their site.

Suggestion: Only start one or two spiders if you have a 'Seed URL'. Once the 'URLs Stored' value is past a reasonable number (eg. 50), you can stop the spider and restart 5-10 spiders.

Remember, the system administrators of the sites you are spidering obtain an e-mail address, and an IP, and can trace back to any abusers of their system.


Home | Download | Developers | News | Services | Contacts

webmaster@mobilemaps.com
Mobilemaps.com is Copyright 2003 High Country Software Ltd.