[Home]WikiPatches/RobotsNoFollow

UseModWiki | WikiPatches | RecentChanges | Preferences

Added a modification of version D in 1.0.2 --MarkusLude
Add this? http://www.google.com/googleblog/2005/01/preventing-comment-spam.html That really belongs as a separate patch. See /RelNoFollow
We want to prevent search engine bots from exploring old revisions, search results, and other "helper" pages. Basically, we only want the most recent pages indexed. This is a WikiSpam deterring measure.

With this patch all actual wiki article pages get the INDEX and the NOFOLLOW robot tags. All other pages get the NOINDEX, NOFOLLOW tags.

Now, in order to get new pages added to the search engines, we need to have one page that does have the FOLLOW tag. This will be RecentChanges. This page must be indexed by the bots, too, so that the can return to it the next time. Therefore, even though it makes no sense to have the RecentChanges content indexed, we need the URL in the bot database. This is why the page also gets the INDEX tag.

If your wiki hasn't yet been fully crawled by the search engine you might want to manually submit the recent changes url which shows the last 1000 days (UseMod:action=rc&days=1000).

Note this patch is rather heavy handed, and could prevent your wiki being indexed by search engines properly (you don't get such good rankings). To acheive the desired WikiSpam detterence while allowing search engines to crawl your article pages as normal, apply RobotsNoFollowC below, or the newer /RobotsMetaTag.

For the RobotsNoFollow patch there are only two changes needed. The main change being the new code inserted into GetHtmlHeader, and then to pass $id to it from GetHeader. Note that for all non-wiki pages, $id is ''.

sub GetHeader {
  [...]
  $result .= &GetHtmlHeader("$SiteName: $title",$id);

sub GetHtmlHeader {
  my ($title,$id) = @_;
  [...]
  # robot FOLLOW tag for RecentChanges only
  # robot INDEX tag for wiki pages only
  # Note that we need to allow INDEX for RecentChanges, else new pages
  # will never be added
  if (($id eq $RCName) || (T($RCName) eq $id) || (T($id) eq $RCName)) {
    $html .= qq(<META NAME="robots" CONTENT="INDEX,FOLLOW">\n);
  } elsif ($id eq '') {
    $html .= qq(<META NAME="robots" CONTENT="NOINDEX,NOFOLLOW">\n);
  } else {
    $html .= qq(<META NAME="robots" CONTENT="INDEX,NOFOLLOW">\n);
  }

Wouldn't it be better to index LocalWiki:action=index instead of RecentChanges? -- StefanTrcek


RobotsNoFollowB

The following is a patch against Version 1.0 that also sets INDEX,FOLLOW for the action=index page, making it a little easier for search engines to find all your pages. The code is taken more or less directly from OddMuse.

Search engines are still not allowed to crawl between article pages, or to legitimate external links on your article pages, so this patch is still a little heavy handed. To acheive the desired WikiSpam detterence while allowing search engines to crawl your article pages as normal, apply RobotsNoFollowC below, or the newer /RobotsMetaTag.

*** wiki.pl	2004-05-25 18:26:38.000000000 +0200
--- larpwiki-robots-nofollow.pl	2004-05-25 18:28:26.000000000 +0200
***************
*** 26,31 ****
--- 26,33 ----
  #    59 Temple Place, Suite 330
  #    Boston, MA 02111-1307 USA
  
+ # applied patch: larpwiki-robots-nofollow.diff
+ 
  package UseModWiki;
  use strict;
  local $| = 1;  # Do not buffer output (localized for mod_perl)
***************
*** 1291,1297 ****
    if ($FreeLinks) {
      $title =~ s/_/ /g;   # Display as spaces
    }
!   $result .= &GetHtmlHeader("$SiteName: $title");
    return $result  if ($embed);
  
    $result .= '<div class=wikiheader>';
--- 1293,1299 ----
    if ($FreeLinks) {
      $title =~ s/_/ /g;   # Display as spaces
    }
!   $result .= &GetHtmlHeader("$SiteName: $title", $id);
    return $result  if ($embed);
  
    $result .= '<div class=wikiheader>';
***************
*** 1342,1348 ****
  }
  
  sub GetHtmlHeader {
!   my ($title) = @_;
    my ($dtd, $html, $bodyExtra, $stylesheet);
  
    $html = '';
--- 1344,1350 ----
  }
  
  sub GetHtmlHeader {
!   my ($title, $id) = @_;
    my ($dtd, $html, $bodyExtra, $stylesheet);
  
    $html = '';
***************
*** 1367,1372 ****
--- 1369,1388 ----
    if ($stylesheet ne '') {
      $html .= qq(<LINK REL="stylesheet" HREF="$stylesheet">\n);
    }
+   # INDEX,NOFOLLOW tag for wiki pages only so that the robot doesn't index
+   # history pages.  INDEX,FOLLOW tag for RecentChanges and the index of all
+   # pages.  We need the INDEX here so that the spider comes back to these
+   # pages, since links from ordinary pages to RecentChanges or the index will
+   # not be followed.
+   if (($id eq $RCName) or (T($RCName) eq $id) or (T($id) eq $RCName)
+       or (lc (GetParam('action', '')) eq 'index')) {
+     $html .= '<meta name="robots" content="INDEX,FOLLOW">';
+   } elsif ($id eq '') {
+     $html .= '<meta name="robots" content="NOINDEX,NOFOLLOW">';
+   } else {
+     $html .= '<meta name="robots" content="INDEX,NOFOLLOW">';
+   }
+   #finish
    $html .= $UserHeader;
    $bodyExtra = '';
    if ($UserBody ne '') {


RobotsNoFollowC

An alternative to the above patch, this requires no tricks to allow search engine robots to index the current version of all wiki pages. It only sets the NOFOLLOW and NOINDEX meta tags on diff pages and history pages.

This patch is for UseMod v1.0. by TomScanlan

If you want to use the feature, set $MetaNoIndexHist?=1, or set it to 0 to disable the feature.


--- wiki.pl     (revision 16)
+++ wiki.pl     (working copy)
@@ -53,7 +53,7 @@
   @IsbnNames @IsbnPre @IsbnPost $EmailFile $FavIcon $RssDays $UserHeader
   $UserBody $StartUID $ParseParas $AuthorFooter $UseUpload $AllUpload
   $UploadDir $UploadUrl $LimitFileUrl $MaintTrimRc $SearchButton
-  $EditNameLink $UseMetaWiki @ImageSites $BracketImg );
+  $EditNameLink $UseMetaWiki @ImageSites $BracketImg $MetaNoIndexHist);
 # Note: $NotifyDefault is kept because it was a config variable in 0.90
 # Other global variables:
 use vars qw(%Page %Section %Text %InterSite %SaveUrl %SaveNumUrl
@@ -137,6 +137,7 @@
 $LogoLeft     = 0;      # 1 = logo on left,       0 = logo on right
 $RecentTop    = 1;      # 1 = recent on top,      0 = recent on bottom
 $UseDiffLog   = 1;      # 1 = save diffs to log,  0 = do not save diffs
+$MetaNoIndexHist  = 0;      # 1 = Disallow robots indexing old pages, 0 = Allow robots to index old pages
 $KeepMajor    = 1;      # 1 = keep major rev,     0 = expire all revisions
 $KeepAuthor   = 1;      # 1 = keep author rev,    0 = expire all revisions
 $ShowEdits    = 0;      # 1 = show minor edits,   0 = hide edits by default
@@ -1343,8 +1344,12 @@

 sub GetHtmlHeader {
   my ($title) = @_;
-  my ($dtd, $html, $bodyExtra, $stylesheet);
+  my ($dtd, $html, $bodyExtra, $stylesheet, $action, $revision, $diff);

+  $action = lc(&GetParam('action', ''));
+  $revision = lc(&GetParam('revision', ''));
+  $diff = lc(&GetParam('diff', ''));
+
   $html = '';
   $dtd = '-//IETF//DTD HTML//EN';
   $html = qq(<!DOCTYPE HTML PUBLIC "$dtd">\n);
@@ -1358,6 +1363,17 @@
       $keywords =~ s/([a-z])([A-Z])/$1, $2/g;
       $html .= "<META NAME='KEYWORDS' CONTENT='$keywords'/>\n" if $keywords;
   }
+
+  # if we don't want robots indexing our history pages
+  if ($MetaNoIndexHist) {
+       if (($action eq "browse" && $revision ne '') || # looking at a diff
+               ($action eq "browse" && $diff ne '') || # looking at a diff
+               ($action eq "history" ) ) {                                     # looking at history page
+
+               $html .= "<META NAME='robots' CONTENT='noindex,nofollow'/>";
+       }
+  }
+
   if ($SiteBase ne "") {
     $html .= qq(<BASE HREF="$SiteBase">\n);
   }



RobotsNoFollowD

I tried out the latter patch ("C") and realized that it lets too much be indexed, in particular, it lets the edit page be indexed, which means spammers could point search engines at wiki.pl?action=edit&revision=10 and still get their junk indexed. It also prevent any other "action=" pages from being indexed. And I thought the other patches were too complex. So I came up with my own patch, which only allows regular pages, recent changes and the index page to be indexed, everything else is "noindex,nofollow". Also I made this non-optional behavior. So anyway, here it is (just add this into GetHtmlHeader where the previous patch goes).


  # we don't want robots indexing our history or other admin pages
  {
      my $action = lc(&GetParam('action', ''));

      if ($action eq "" ||                            # regular page browse
          $action eq "rc" ||                          # recent changes
          $action eq "index")                         # page list
      {
          $html .= "<META NAME='robots' CONTENT='index,follow'/>\n";
      }
      else
      {
          $html .= "<META NAME='robots' CONTENT='noindex,nofollow'/>\n";
      }
  }

I know this could probably be reduced to two or three lines of code, but I figured that could be done after we agree on what should and shouldn't be indexed. -- Trent

This is probably the most elegant solution, congratulations. One suggestion though -- since robots DO index and follow by default (it's what they are for), the first part of the condition is needless. Replace "if" with "unless" and take out the part up to "else", and it will be even cleaner. -- UngarPeter

I added this for 1.0.2 in the simplified form. Should it be an option? I simply added it, instead of with an option. -- MarkusLude


WikiPatches

UseModWiki | WikiPatches | RecentChanges | Preferences
Edit text of this page | View other revisions | Search MetaWiki
Last edited September 27, 2007 6:55 pm by JuanmaMP (diff)
Search: