Extracted function comments Mon Aug 22 00:40:04 2005 =item AdminVersion =cut =item Append =cut =item Assert Usage: ##&Assert( conditional expression ); Assert is a useful debugging tool. Its one argument is a conditional that should be true in every possible case, as long as you've written your code correctly. If the argument turns out to be false at runtime, then Assert will print an error message in very large, bold letters. Often used to audit function input and output values. Possibly these Assert calls should be stripped or disabled in public releases. =cut =item Authenticate =cut =item BuildIndex Usage: &BuildIndex(); BuildIndex completely rebuilds the index for a local realm. Because the webpages in local realms are readily accessible, this function tends to process huge data sets quickly. It is self-restartable through a meta-refresh; state information is stored in the $start_pos parameter and working data is stored either in the database or the index_file.working_copy file. For file-based indexes, all new data is written to index_file.working_copy. When the process is finished, possibly after several browser requests, the original index_file is deleted and index_file.working_copy is renamed over the top of it. Thus, users are able to perform searches on the intact index_file while the BuildIndex process in progress. In addition, it is possible to safely abandon the BuildIndex process. For SQL-based indexes, we don't have that concept of a temporary storage area. Instead, each record is updated as the webpage is encountered. At the end of the BuildIndex process, if we get there, we delete all records whose lastindex time is older than "start_time". The only records older than "start_time" are those that were not detected by GetFilesByDirEx, or that were excluded for other reasons. This is an interactive function; errors and other status messages are shown to the user by printing HTML. =cut =item Cancel =cut =item Capitalize Usage: my $cap_string = &Capitalize($string); Capitalizes English-language strings. =cut =item CheckEmail Usage: my $err = &CheckEmail( $address ); if ($err) { print "
Error: $err.
\n"; } Checks whether the argument is a valid email address or not: address not blank contains text @ text text follow @ is valid hostname (can be resolved) Based on Ian Dobson's CheckEmail function. =cut =item Close =cut =item CompressStrip Process the HTML text and various subfields like Title and Description. =cut =item Crawler_new Usage my %response = $crawler->webrequest( 'page' => 'http://www.xav.com/scripts/', 'limit' => 'http://www.xav.com/', ); if ($response{'err'}) { print "Error: $response{'err'}
\n"; exit; } print "The HTML text of this web page is:\n\n"; print $response{'text'}; =cut =item DeleteFromPending Usage: my ($err, $delcount) = &DeleteFromPending( $realm, \@urls ); =cut =item FD_Rules_new Initializes the object that manages system settings. =cut =item FlockEx Usage: if (&FlockEx( $p_filehandle, 8 )) { # okay } Abstraction layer to protect non-flock systems. =cut =item FormatDateTime =cut =item FormatNumber Usage: my $num_str = &FormatNumber( $expression, $decimal_places, $include_leading_digit, $use_parens_for_negative, $group_digits, $euro_style ); Arguments $expression Required. Expression to be formatted. $decimal_places Optional. Numeric value indicating how many places to the right of the decimal are displayed. Note: truncates $expression to $decimal_places, does not round. $include_leading_digit Optional. Boolean that indicates whether or not a leading zero is displayed for fractional values. $use_parens_for_negative Optional. Boolean that indicates whether or not to place negative values within parentheses. Style is used for outbound formatting only; inbound parsing always uses "-" for dec (Perl's internal format) $group_digits Optional. Boolean that indicates whether or not numbers are grouped using the comma. $euro_style Optional. If 1, then "." separates thousands and "," separates decimal. i.e. "800.234,24" instead of "800,234.24". Style is used for outbound formatting only; inbound parsing always uses "." for dec (Perl's internal format) Prototyped to match Microsoft's FormatNumber function for vbscript/jscript, with the limitation of not knowing about default settings. Microsoft specification at http://msdn.microsoft.com/scripting/vbscript/doc/vsfctFormatNumber.htm or from http://msdn.microsoft.com/scripting/. Error handling: if $expression is not numeric, is treated as 0 =cut =item GetCrawlList Usage: my @list = (); my $count = 0; my $age = $::FORM{'StartTime'}; if ($::FORM{'DaysPast'}) { $age -= (86400 * $::FORM{'DaysPast'}); } my $err = &GetCrawlList( $realm, $age, $max_list_size, \@list, \$count ); Retrieves a @list of all web pages in the '$realm' realm that are older than $age. $count is the size that @list would be if no limits were imposed. @list will actually contain between 0 to $max_list_size elements. The max_list_size option is available to save memory. =cut =item GetFiles_new Used to enumerated all files and folders in a certain directory. Designed to use very little memory. Files are always returned in alphabetic order, which allows certain optimizations to be made. Usage: my $fr = &fdse_filter_rules_new(); my $gf = &GetFiles_new(); $err = $gf->create_file_list( 'base_dir' => $base_dir, 'base_url' => $base_url, 'fr' => \$fr, 'tempfile' => "$file.temp", 'no_older_than' => $num_seconds, ); my $count = $gf->{'count'}; $gf->resume_file_position( $start_pos ); while (1) { my ($lastmodt, $size, $fullfile, $basefile, $url) = $gf->get_next_file(); } $gf->quit(); # kills temp file no_older_than is the number of seconds for the maximum tolerable age of the cache file. If the file exists and is older than this, then a new file will be created. =cut =item LoadRules Usage: $err = &LoadRules(); Wrapper around FD_Rules object and it's own loadrules() method. Adds additional processing. Writes directly to the global %::Rules hash. Writes some derived data to %::const as well. =cut =item LockFile_get_read_access Gets read access to the file. Handles the "create_if_needed" logic. Tries to restore a stale "working_copy" file if not copy of the original file exists. =cut =item LockFile_new This package provides an object-oriented approach to file I/O, with support for file locking and standardized error handling. Usage: my ($err, $obj, $p_rhandle, $p_whandle) = (); Err: { $obj = &LockFile_new( 'create_if_needed' => 1, ); ($err, $p_rhandle) = $obj->Read( $file ); next Err if ($err); while ($_ = readline($$p_rhandle)) { print $_; } $err = $obj->Close(); next Err if ($err); last Err; } continue { print "Error: $err.
\n"; } =cut =item Merge =cut =item ParseRobotFile Usage: my @forbidden_paths = &ParseRobotFile( $RobotText, $my_user_agent ); Accepts the text of a robots.txt file, and the string name of the current HTTP user-agent. Parses through the file and returns an array of all forbidden paths that apply to the current user-agent. =cut =item PrintOrderedHash Usage: my $err = &PrintOrderedHash( \%hash, $by_value, $ascii_sort, $ascending, $date_map ); =cut =item PrintTemplate Usage: &PrintTemplate( $b_return_as_string, 'tips.html', 'german', \%replace_values, \%visited, \%cache ); See "admin_help.html" for extensive documentation on this function, its limitations, its failure scenarios, etc. =cut =item RawTranslate Usage: my $lc_ai_string = &RawTranslate($string); Returns a lowercase, accent-stripped version on its input. Replaces HTML-encoded characters with their ASCII equivalents. This function is called mainly by &CompressStrip; also by &LoadRules when preparing the code for ignore words. See http://www.utoronto.ca/webdocs/HTMLdocs/NewHTML/iso_table.html =cut =item Read =cut =item ReadFile Usage: my ($err, $text) = &ReadFile($file); if ($err) { print "Error: $err
"; } else { print "File '$file' contains:
"; print "$text
"; } Easy-to-call file-reading function. Calls super-robust LockFile object under the hood, which is a relatively expensive call. This is done for operations which read data from the file system into memory, and then save data back to the file system. For these operations, we cannot afford to have a single failed read operations cause permanent data loss. Examples of read failures would be "file locked for writing by another process". =cut =item ReadFileL Usage: ($err, $text) = &ReadFileL( $filename ); Returns the text of the given file, or an error. Uses direct disk I/O rather than the more expensive LockFile package. =cut =item ReadInput Reads CGI form input, or command-line parameters. Initializes %$p_FORM and assigns values. Usage: &ReadInput(); Abstracts the source of the commands (can be query string, standard input, or command-line parameters). Automatically updates global hash %::FORM. =cut =item ReadWrite =cut =item Resume =cut =item SaveLinksToFileEx Usage: my $err = &SaveLinksToFileEx( $p_realm_data, $ref_crawler_results, $ref_spidered_links, $ref_links_new, $ref_links_visited_fresh, $ref_links_visited_old, $ref_links_error, ); if ($err) { print "Error: $err.
\n"; } Saves all links from this crawl sessions to the pending pages file (search.pending.txt). File format is: URL &ue(realm) number where number is one of: 0 => waiting to be indexed 2 => encountered problems during index 2+ => epoch time of the index operation =cut =item SearchIndexFile Usage: &SearchIndexFile( $index_file, $search_code, \$pages_searched, \@HITS ); Searches the given index file. Uses by-reference return values for the total pages searched and the array of hits. =cut =item SearchRunTime Usage: &SearchRunTime( $realm, $DocSearch, \$pages_searched, \@HITS ); =cut =item SelectAdEx Usage: my @Ads = &SelectAdEx(); Returns the text for up to 4 ads. If keywords present in $::private{'search_term_patterns'} then the ads will be keywords-based. =cut =item SendMailEx Specification Lightweight, portable, Perl library for sending mail in a reliable fashion. Designed for the occassional message, not for being a massive 24x7 mailer. Requirements: absolutely zero dependencies; no external Perl modules, etc. clean: use strict, -w, -W, -T, prototypes ok callable as a single standalone function, not a package. use byref hash to optionally preserve state between calls must be able to send mail w/ raw sockets for those hosts without command-line sendmail (NT) must be able to send mail w/ command-line sendmail for those hosts without sockets privileges on port 25 (free webhosts) allow caller to specify buffered/unbuffered I/O (sysread vs read, syswrite vs print) must be very safe with user data - try really hard not to lose messages (retry, option to save to disk on socket failure, etc.) able to send mail multiple ways - sockets, |sendmail, or save-to-file must comply with "run 4ever" goal - don't overflow file system with saved messages, etc. allow verbose/debug mode which traces all socket traffic when possible, should auto-detect necessary SMTP servers - currently uses `nslookup` use extracted strings array for error messages. allow caller to import a translated set. do not write to STDOUT; do your work and return error status; let calling code deal with the user Internal Structure: Network Client Cache - %nc_cache - $p_nc_cache hash (or reference to) with: values: V:loaded = 1 or undef depending on whether these values have been queried: $$p_nc_cache{'V:PF_INET'} = PF_INET(); $$p_nc_cache{'V:SOCK_STREAM'} = SOCK_STREAM(); $$p_nc_cache{'V:PROTO'} = scalar getprotobyname('tcp'); hostnames: (all hostnames converted to lowercase) H:foo.bar.com => 4-byte IP address or undef() Usage: my $message = <<"EOM"; Hi there Bob! How has life been treating you? Regards, Joe EOM my ($err, $trace) = &SendMailEx( 'to' => 'user@host.com', 'to name' => 'Bob User', # * 'from' => 'me@host.com', 'from name' => 'Sally User', # * 'subject' => 'Hi Sally', # * 'message' => $message, 'host' => 'mail.foo.com', # * 'port' => 25, # * 'saveto' => 'e:/saved_msgs', 'max_saved_messages' => 1000, 'handler_order' => '12345', 'always_save' => 1, ); # * optional field if ($err) { print "Error: $err.
\n"; } else { print "Success: sent mail okay.
\n"; } print "Here is the trace:
\n\n"; print "