InitLinkPatterns sets up several variables used in analysing text for link patterns.
Returns: null
First, we set up $FS, $FS1, $FS2, $FS3. The file separator is a funny character: \xb3, which is actually a little superscript 3. This was presumably chosen because it is extremely rare and not really a 'legal' character in normal text.
$FS is \xb3, and the others are just \xb3 with '1', '2', or '3' appended.
This routine constructs the $LinkPattern and $FreeLinkPattern?, depending on such options as $NonEnglish?, $UseSubpage?, etc. This is moderately confusing in the details, but it is actually very simple overall. We're just setting up some fancy regular expressions.
We also set up here some other fancy matching things, like $UrlPattern? and $ISBNPattern.
Note: in the current actual code, the local variables are capitalized...
sub InitLinkPatterns { my ($upperLetter, $lowerLetter, $anyLetter, $lpA, $lpB, $qDelim);
# Field separators are used in the URL-style patterns below. $FS = "\xb3"; # The FS character is a superscript "3" $FS1 = $FS . "1"; # The FS values are used to separate fields $FS2 = $FS . "2"; # in stored hashtables and other data structures. $FS3 = $FS . "3"; # The FS character is not allowed in user data.
$upperLetter = "[A-Z"; $lowerLetter = "[a-z"; $anyLetter = "[A-Za-z"; if ($NonEnglish?) { $upperLetter .= "\xc0-\xde"; $lowerLetter .= "\xdf-\xff"; $anyLetter .= "\xc0-\xff"; } if (!$SimpleLinks) { $anyLetter .= "_0-9"; } $upperLetter .= "]"; $lowerLetter .= "]"; $anyLetter .= "]";
# Main link pattern: lowercase between uppercase, then anything $lpA = $upperLetter . "+" . $lowerLetter . "+" . $upperLetter . $anyLetter . "*"; # Optional subpage link pattern: uppercase, lowercase, then anything $lpB = $upperLetter . "+" . $lowerLetter . "+" . $anyLetter . "*";
if ($UseSubpage?) { # Loose pattern: If subpage is used, subpage may be simple name
# $LinkPattern = "((?:(?:$lpA)?(?:\\/$lpB)+)|$lpA)"; $LinkPattern = "((?:(?:$lpA)?\\/$lpB)|$lpA)"; # Strict pattern: both sides must be the main LinkPattern # $LinkPattern = "((?:(?:$lpA)?\\/)?$lpA)"; } else { $LinkPattern = "($lpA)"; } $qDelim = '(?:"")?'; # Optional quote delimiter (not in output) $LinkPattern .= $qDelim;
# Inter-site convention: sites must start with uppercase letter # (Uppercase letter avoids confusion with URLs) $InterSitePattern? = $upperLetter . $anyLetter . "+"; $InterLinkPattern? = "((?:$InterSitePattern?:[^\\]\\s\"<>$FS]+)$qDelim)";
if ($FreeLinks) { # Note: the - character must be first in $anyLetter definition if ($NonEnglish?) { $anyLetter = "[-,.()' _0-9A-Za-z\xc0-\xff]"; } else { $anyLetter = "[-,.()' _0-9A-Za-z]"; } } $FreeLinkPattern? = "($anyLetter+)"; if ($UseSubpage?) { # $FreeLinkPattern? = "((?:(?:$anyLetter+)?(?:\\/$anyLetter+)+)|$anyLetter+)" ; $FreeLinkPattern? = "((?:(?:$anyLetter+)?\\/)?$anyLetter+)"; } $FreeLinkPattern? .= $qDelim;
# Url-style links are delimited by one of: # 1. Whitespace (kept in output) # 2. Left or right angle-bracket (< or >) (kept in output) # 3. Right square-bracket (]) (kept in output) # 4. A single double-quote (") (kept in output) # 5. A $FS (field separator) character (kept in output) # 6. A double double-quote ("") (removed from output)
$UrlProtocols? = "http|https|ftp|afs|news|nntp|mid|cid|mailto|wais|" . "prospero|telnet|gopher"; $UrlProtocols? .= '|file' if $NetworkFile; $UrlPattern? = "((?:(?:$UrlProtocols?):[^\\]\\s\"<>$FS]+)$qDelim)"; $ImageExtensions = "(gif|jpg|png|bmp|jpeg)"; $RFCPattern = "RFC\\s?(\\d+)"; $ISBNPattern = "(ISBN|ASIN):?([0-9- xX]{10,})"; }