Classreference: Class PHPCrawler

Constructor
PHPCrawler() Initializes a new crawler.
Basic methods
setURL() Sets the root-page to crawl
setPort() Sets the port to connect to
go() Starts the crawling-process
getReport() Retruns an array with report-information after the crawling-process has finished
Overrideable methods
handlePageData()) Overridable method to access and handle the information and content of found pages and files
Follow-options
setFollowMode() Sets the general follow mode (which links to follow)
addFollowMatch() Adds an regular expression (PCRE) to the list of rules that decide which links should be followed explicitly.
addNonFollowMatch() Adds an regular expression (PCRE) to the list of rules that decide which links should be ignored.
setFollowRedirects() Decides if the crawler should follow redirects sent in headers.
setFollowRedirectsTillContent() Decides if the crawler should follow redirects until first content was found, regardless of the follow-mode.
addLinkPriority() Adds a regular expression togehter with a priority-level to the list of rules that decide which of the found links should be prefered.
obeyRobotsTxt() Decides if the crawler should obey robots.txt-files.
Receive-options
addReceiveContentType() Adds a regular expression (PCRE) to the list of rules that decide which pages or files should be received.
setTmpFile() Sets the temporary file to use for receiving data.
addReceiveToMemoryMatch() Adds a expression to the list of rules that decide which kind of content should be received directly to local memory.
addReceiveToTmpFileMatch() Adds a expression to the list of rules that decide which kind of content should be received to a temporary file.
Limiter-options
setPageLimit() Sets the limit of pages/files the crawler should crawl alltogether.
setTrafficLimit() Sets the limit of bytes the crawler should receive alltogether.
setContentSizeLimit() Sets the content-size-limit for content the crawler should receive.
Linkextraction-options
setAggressiveLinkExtraction() Enables or disables agressive link-extraction.
addLinkExtractionTags() Adds tags to the list of tags from which links should be extracted.
Timeout-options
setConnectionTimeout() Sets the timeout for the connection (request) to the server(s).
setStreamTimeout() Sets the timeout for streams when the crawler is receiving content.
Miscellaneous options
setCookieHandling() Turns on/off the cookie-handling of the crawler.
addBasicAuthentication() Adds an URL-filterexpression together with an authentication (username/passwd) to the list of authentications to send.
disableExtendedLinkInfo() Disables the storage of extended link-information of found links.
setUserAgentString() Sets the "User-Agent"-string that will be send with request headers.

Details

new PHPCrawler()

Initializes a new instance of the crawler.

Important: You shouldn't create an instance of the class diretly! Instead extend the class and override the handlePageData()-method.

bool setURL (string url)

Sets the URL of the first page the crawler should crawl (root-page).
It can be the root of a domain (f.e. www.foo.com) or a path to a special site or folder (f.e. www.foo.com/bar/something.html).

Note: This url has to be set before calling the go()-method (of course) ! If this root-page doesn't contain any further links, the crawling-process will stop.

bool setPort (int port)

(since version 0.7)

Sets the port of the hosting server to connect to for receiving the page/file given in setURL().
The default port is 80.

Note:
$cralwer->setURL("http://www.foo.com");
$crawler->setPort(443);

..effects the same as

$cralwer->setURL("http://www.foo.com:443");

void go ()

Starts the crawling process.

Note: At least you should use the method addReceiveContentType() to let the crawler receive "text/html"-pages before calling the go-method, otherwise the crawler can't find any links. Also be sure that you did override the handlePageDate()-method before calling the go()-method. Otherwise the crawler will start the process but nothing will happen to the data the crawler finds!

array getReport ()

After the crawling-process has finished, this method returns an array with information about the process. The following table lists the elements the array will contain.

Key Type Value
links_followed int The number of links/URLs the crawler found and followed.
files_received int The number of pages/files the crawler received.
bytes_received int The number of bytes the crawler received alltogether.
process_runtime float The time the crawling-process was running in seconds.
(since veriosn 0.7)
data_throughput int The average data-throughput in bytes per second.
(since veriosn 0.7)
traffic_limit_reached bool Will be TRUE if the crawling-process stopped becaus the traffic-limit was reached.
(See method setTrafficLimit())
file_limit_reached bool Will be TRUE if the page/file-limit was reached.
(See method setPageLimit())
user_abort bool Will be TRUE if the crawling-process stopped because the overridable function handlePageData() returned a negative value.
(since veriosn 0.7)

bool setFollowMode (int mode)

This method sets the general follow-mode of the crawler.
The following table lists and explains the supported follow-modes.

mode explanation
0 The crawler will follow EVERY link, even if the link leads to a different host or domain. If you choose this mode, you really should set a limit to the crawling-process (see limit-options), otherwise the crawler maybe will crawl the whole WWW !
1 The crawler will follow links that lead to the same host AND to hosts with the same domain like the one in the root-url.
F.e. if the root-url (setURL()) is "http://www.foo.com", the crawler will follow links to "http://www.foo.com/..." and "http://bar.foo.com/...", but not to "http://www.another-domain.com/...".
2 The crawler will only follow links that lead to the same host like the one in the root-url.
F.e. if the root-url (setURL()) is "http://www.foo.com", the crawler will ONLY follow links to "http://www.foo.com/...", but not to "http://bar.foo.com/..." and "http://www.another-domain.com/...".
This is the default mode.
3 The crawler only follows links to pages or files that are in or under the same path like the one of the root-url.
F.e. if the root-url is "http://www.foo.com/bar/index.html", the crawler will follow links to "http://www.foo.com/bar/page.html" and "http://www.foo.com/bar/path/index.html", but not links to "http://www.foo.com/page.html".

int handlePageData (Array page_data)

By overriding this method you get access to all information about the pages and files the crawler found and followed. This method receives the information about the actual requested page/file through the array $page_data. The tables below list the elements the array will contain.

Since verion 0.7 the whole crawling-process will stop immedeatly if this function returns a negative value (f.e. -1).

Information about the current URL
Key Type Value
url string The complete URL of the actual requested page or file, f.e. "http://www.foo.com/bar/page.html?x=y".
protocol string The protocol-part of the URL of the requested page or file, currently it will always be "http://".
host string The host-part of the URL of the requested page or file, f.e. "www.foo.com".
path string The path in the URL of the requested page or file, f.e. "/page/".
file string The name of the requested page or file, f.e. "page.html".
query string The query-part of the URL of the requested page or file, f.e. "?x=y".
port int The port of the URL the request was send to, f.e. 80


Information about the header and the content of the current URL
received boolean TRUE if the crawler received at least some source/content of this page or file and will follow the links it found in the source.
See also addReceiveContentType() and setContentSizeLimit().
received_completely
received_completly
boolean TRUE if the crawler received the COMPLETE source/content of this page or file.
See also setContentSizeLimit().
bytes_received int The number of bytes the crawler received of the content of this page or file.
header string The complete header the webserver sent with this page or file.
header_send string The complete header the crawler sent to the server (debugging).
(since version 0.7)
http_status_code int The HTTP-statuscode the webserver send for the request, f.e. 200 (OK) or 404 (file not found).
content_type string The content-type of the page or file, f.e. "text/htrml" or "image/gif".
(since version 0.7)
received_to_memory boolean Will be true if the content was received into local memory.
You will have access to the content of the current page or file through $page_data[source]. (since version 0.7)
received_to_file boolean Will be true if the content was received into a temporary file.
The content is stored in the temporary file $page_data[content_tmp_file] in this case. (since version 0.7)
source string The html-sourcecode of the page or the content of the file actually requested and received.
It will be empty if "received" is FALSE and the source wont be complete if "received_completly" is FALSE !
It also will be empty if the content wasn't received into memory.
content string A reference to the element "source" (see above).
(since version 0.7)
content_tmp_file string The temporary file to which the content was received.
Will be empty if the content wasn't received to the temporary file.
(since version 0.7)


Referer information
referer_url string The complete URL of the page that contained the link to the actual requested page or file.
refering_linkcode string The html-sourcecode that contained the link to the current page or file.
(F.e. <a href="../foo.html">LINKTEXT</a>)
Note: Will NOT be available if disableExtendedLinkInfo() was set to true.
(since version 0.7)
refering_link_raw string Contains the raw link as it was found in the content of the refering URL.
(F.e. "../foo.html")
Note: Will NOT be available if disableExtendedLinkInfo() was set to true.
(since version 0.7)
refering_linktext string The linktext of the link that "linked" to the current page or file.
F.e. if the refering link was <a href="../foo.html">LINKTEXT</a>, the refering linktext is "LINKTEXT".
May contain html-tags of course.
Note: Will NOT be available if disableExtendedLinkInfo() was set to true.
(since version 0.7)


Information about found links in the current page
links_found Array An numeric array with information about every link that was found in the current URL. Every element of that numeric array contains the following elements again:

link_raw - contains the raw link as it was found.
url_rebuild - contains the full qualified URL the link leads to
linkcode - the html-codepart that contained the link.
linktext - the linktext the link was layed over (may be empty).

So, f.e., $page_data[links_found][5][link_raw] contains the fifth link that was found in the current page. (May be something like "../../foo.html").

(since version 0.7)


Error information
error_code int A representating errorcode for a socket-connection-error that occured when trying to open the acutal page or file or any other error that occured.
It will be false if no error occured.
error_string string A string-description for a socket-connection-error that occured when trying to open the acutal page or file or any other error that occured.
It will be empty if no error occured.

bool addFollowMatch (string expression)

Adds a perl-compatible regular expression (PCRE) to the list of rules that decide which URLs found on a page should be followd explicitly.
If no expression was added to this list, the crawler won't filter any URLs, every URL will be followed (except the ones "excluded" by other options of course).

This method returns TRUE if a valid preg-pattern is given as argument and was succsessfully added, otherwise it returns FALSE.

Example:
$crawler->addFollowMatch("/.(html|htm)$/ i")

This rule lets the crawler ONLY follow URLs/links that end with ".html" or ".htm".

Note: To get very sure that the crawler only receives files with special filetypes you should use the addReceiveContentType()-method.

bool addNonFollowMatch (string expression)

Adds a perl-compatible regular expression (PCRE) to the list of rules that decide which URLs found on a page should be ignored by the crawler.

This method returns TRUE if a valid preg-pattern is given as argument and was succsessfully added, otherwise it returns FALSE.

Example:
$crawler->addNonFollowMatch("/.(jpg|jpeg|gif|png|bmp)$/ i")

This rule lets the crawler completly ignore all found URLs that end with ".jpg", ".gif", ".png" and so on. None of the matching URLs will be followed or received.

Note: To get very sure that the crawler only receives files with special filetypes you should use the addReceiveContentType()-method. Use the addNonFollowMatch()-method just to "pre-filter" the URLs to reduce the the number of requests the crawler will send.

bool setFollowRedirects (bool mode)

This method decides if the crawler should follow redirects sent in headers by a webserver or not.
The default-value is TRUE.

bool setFollowRedirectsTillContent (bool mode)

This method decides if the crawler should follow redirects until first content was found, regardless of the follow-mode (method setFollowMode()).

Sometimes when requesting an URL, the first thing the webserver does is sending a redirect to another location, and sometimes the server of this new location is sending a redirect again and so on. So at least its possible that you find the expected content on a totally different host.
If you set this option to TRUE, the crawler will follow all these redirects until it finds some content. If content finally was found under an URL, the root-url of the crawling-process will be set to this url and all follow-mode-options will relate to it from now on.

The default-value is TRUE.

bool addLinkPriority (string expression, int priority_level)

(since verion 0.7)

Adds a regular expression togehter with a priority-level to the list of rules that decide which of the found links should be prefered (requested next).
Links/URLs that match an expression with a high priority-level will be followed before links with a lower level. All links that don't match with any of the given expressions will get the level 0 (lowest level) automatically. The level can be any positive integer, but try to avoid very high numbers (like 10000 f.e., for preformance reasons).

This method returns TRUE if a valid preg-pattern is given as argument and was succsessfully added, otherwise it returns FALSE.

Example:
$crawler->addLinkPriority("/forum/", 10);
$cralwer->addLinkPriority("/\.gif/", 5);

This lets the crawler follow links that contain the sting "forum" before links that contain ".gif" before all other found links.

bool obeyRobotsTxt (bool mode)

(since verion 0.7)

Decides if the crawler should parse and obey robots.txt-files.

If this is set to TRUE, the crawler looks for a robots.txt-file for every host that sites or files should be received from during the crawling process.
If a robots.txt-file for a host was found, the containig directives appliying to the useragent-identification of the cralwer ("PHPCrawl" or manually set by calling setUserAgentString()) will be obeyed.
The default-value is FALSE.

Pleas note that the directives found in a robots.txt-file have a higher priority than other settings made by the user.
If i.e. addFollowMatch("#http://foo\.com/path/file\.html#") was set, but a directive in the robots.txt-file of the host foo.com says "Disallow: /path/", the URL http://foo.com/path/file.html will be ignored by the crawler anyway.

Also note that currently only "Disallow"-directives of robots.txt-files will be interpreted.

bool addReceiveContentType (string expression)

Adds a perl-compatible regular expression (PCRE) to the list of rules that decide which pages or files - regarding their content-type - should be received

IMPORTANT: By default, if no expression was added to the list, the crawler receives every content.

This method returns TRUE if a valid preg-pattern is given as argument and was succsessfully added, otherwise it returns FALSE.

Example:
$crawler->addReceiveContentType("/text\/html/")

This rule lets the crawler completly receive the content/source of pages with the Content-Type "text/html". Other pages or files with different content-types (f.e. "image/gif") wont be received (if this is the only rule added to the list).

Note: To reduce the traffic the crawler will cause, you only should add content-types of pages/files you really want to receive. At least you should add the content-type "text/html" to this list, otherwise the crawler can't find any links.

bool setTmpFile (string path_to_tmpfile)

(since verion 0.7)

Sets the temporary file to use when content of found pages/files should be streamed directly into a tmp-file (see also addReceiveToMemoryMatch()).
By default, a temporary file with a uniqe filename will be created and used in the path your script is running from.

If the given file could be created, this function returns TRUE, otherwise it returns FALSE.

bool addReceiveToMemoryMatch (string expression)

(since verion 0.7)

Adds an expression to the list of rules that decide which content of found pages or files should be streamed directly into local memory.
If a content-type of a found page/file matches with one of these expressions and the content was be received, the content is accessable directly through the array-element $page_data["source"] in the overridable method handlePageData().

IMPORTANT: By default, all files that should be received will be streamed into memory (for compatibility reason) ! As soon as an expression is added the list will take effect.
The settings made here dont effect the link-finding-results in any way.

This method returns TRUE if a valid preg-pattern is given as argument and was succsessfully added, otherwise it returns FALSE.

Examples:

$crawler->addReceiveToMemoryMatch ("/.*/");
This is the default setting, everything will be streamed to memory.

$crawler->addReceiveToMemoryMatch ("/text\/html/");
Only files with content-type "text/html" will be streamed to memory.

$crawler->addReceiveToMemoryMatch("/^((?!image).)*$/");
Everything except images (f.e. "image/gif") will be streamed to memory.

$crawler->addReceiveToMemoryMatch ("/(?!)/");
Nothing will be streamed to memory.

If you configure the crawler to receive big files (addReceiveContentType(), setContentSizeLimit() etc), you should get sure that this files will be streamed to a tmp-file, NOT to memory.

bool addReceiveToTmpFileMatch (string expression)

(since verion 0.7)

Adds an expression to the list of rules that decide which content of found pages or files should be streamed into a temporary file.
If a content-type of a found page/file matches with one of these expressions and the content was be received, the content will be stored in the temporary file $page_data["content_tmp_file"].
You can set this temporary file manually with the function setTmpFile().

This method returns TRUE if a valid preg-pattern is given as argument and was succsessfully added, otherwise it returns FALSE.

The settings made here dont effect the link-finding-results in any way.

Examples:

$crawler->addReceiveToTmpFileMatch("/.*/");
Everything will be streamed to the tmp-file.

$crawler->addReceiveToTmpFileMatch("/image/");
Files with content-type "image/jpeg" or "image/gif" f.e. will be streamed to the tmp-file.

If you configure the crawler to receive big files (addReceiveContentType(), setContentSizeLimit() etc), you should get sure that this files will be streamed to a tmp-file, NOT to memory.

bool setPageLimit (int limit, [bool count_mode=true])

Sets the limit of pages/files the crawler should crawl. If the limit is reached, the crawler stops the crawling-process. The default-value is 0 (no limit).

The count-mode decides, which pages/files should be counted.
TRUE means that only that pages will be counted that the crawler received (see also addReceiveContentType()-method).
FALSE means that ALL followed pages/files will be counted, even if the content wasn't be received.
The default-value for count_mode is TRUE.

bool setTrafficLimit (int bytes, [bool complete_requested_files=true]))

Sets the limit of bytes the crawler should crawl and receive (all in all). If the limit is reached, the crawler stops the crawling-process.
The default-value is 0 (no limit).

The flag complete_requested_files decides if already requested files and pages should be received completly, even if the traffic-limit is reached.
If this is set to TRUE, the crawler finishes receiving these requested files and then stops crawling.
If this is set to FALSE, the process will stop exactly when the traffic-limit is reached, even if a requested file or page wasn't received completly.

Note: Crawling a complete, huge website can cause A LOT OF TRAFFIC, especially if the crawler follows and recieves all kind of data (binary files, pictures etc., see addReceiveContentType() and addNonFollowMatch()-methods).
So it is recommenden that you set a traffic-limit!

bool setContentSizeLimit (int bytes)

Sets the content-size-limit per page/file in bytes for content the crawler should receive. If the crawler is receiving the content of a page or file (see addReceiveContentType()) and the contentsize-limit is reached, the crawler stops receiving content from this page or file.
The default-value is 0 (no limit).

bool setAggressiveLinkExtraction (bool mode)

(since verion 0.7)

Enables or disables agressive link-extraction.
If this is set to FALSE, the crawler tries to find links only inside html-tags (< and >).
If this is set to TRUE, the crawler tries to find links everywhere in an html-page. (script-parts, content-text etc.)
The default value is TRUE.

Note: If agressive-link-extraction is enabled, it happens that the crawler finds links that are not meant as links and it also happens that it finds links in script-parts of pages that can't be rebuild correctly - since there is no javascript-parser/interpreter implemented. (F.e. javascript-code like document.location.href= a_var + ".html").

Disabling agressive-link-extraction also results in a better crawling-performance.

bool addLinkExtractionTags (string tag1, [string tag2, [string ...]])

(since verion 0.7)

Adds html-tags to the list of tags from which links should be extracted (case unsensitive).
By default the crawler extracts links from the followubng html-tags: href, src, url, location, codebase, background, data, profile, action and open.
As soon as a tag is added to the list manually, the default list will be overwritten completly.

Example:
$crawler->addLinkExtractionTags("href", "src");

This setting lets the crawler extract links (only) from "href" and "src"-tags.

Note: Reducing the number of tags in this list will improve the crawling-performance.

bool setConnectionTimeout (double timeout)

Sets the timeout in seconds for the request of a page or file (connection to the server).
The default-value is 10 seconds.

Note: Currently this timeout doesn't take effect in some enviroments.

bool setStreamTimeout (double timeout)

Sets the timeout in seconds for reading data (content) of a page or file. If the connection to a server was be etablished but the server doesnt't send data anymore without closing the connection, the crawler will wait the time given in timeout and then close the connection.

The default-value is 2 seconds.

bool setCookieHandling (bool mode)

Enables or disables cookie-handling.
If the cookie-handling is set to TRUE, the crawler will handle all cookies sent by the webserver just like a common browser does. (At least almost)
The default-value is TRUE.

Note: It is strongly recommended that you enable cookie-handling. Otherwise it happens that special single pages will be crawled, followed and received over and over again (kind of endless loop). This can happen f.e. if the webserver sends a session-cookie and puts the session-ID at the end of some or all links. Now, without cookie-handling, the server generates new session-IDs and sends new session-cookies at every request of the crawler beacuase the crawler didn't store and doesn't "send back" the session-data. TRAP !

bool addBasicAuthentication (string expr, string username, string passwd)

(since verion 0.7)

Adds an authentication (username and password) to the list of basic authentications that will be send with requests for special pages and files (password protected content). The expression (PCRE) specifys for which URLs the given authentication should be send.

Example:
$crawler->addBasicAuthentication("#http://www.foo.com/protected_path/#", "myusername", "mypasswd");

This lets the crawler send the authentication "myusername/mypassw" with every request for content placed in the path "protected_path" on host "www.foo.com".

bool disableExtendedLinkInfo (bool mode)

(since verion 0.7)

Disables the storage of extended link-information of found links like the html-linkcodes, the linktexts etc. Disabling the caching reduces the memory-usage of the crawler more than 50% in some cases, but the extended information will not be passed to the overridable user-method handlePageData() anymore. So if you don't need this information, you should set this option to TRUE.
The default value is FALSE.

bool setUserAgentString (string user_agent)

(since verion 0.7)

Sets the "User-Agent" identification-string in the header that will be send with HTTP-requests.
The default-value is "PHPCrawl".

Note:
It is REALLY recommended to identify yourself and/or your application with this setting, i.e.
"XYZSiteIndexer (myemail@address.net)", so that webmasters and administrators have a chance to track who is spidering their websites and who is getting content from their servers.
Please stay fair!