[Home]WikiPatches/BetterSearchOutput

UseModWiki | WikiPatches | RecentChanges | Preferences

Better search output

Many of my users have pointed out that the page titles returned from a search (often) do not provide enough information to indicate if the information they're looking for is contained therein. So I've implemented a google-like search results page which provides some snippets of the document which contain the search text. The code is here for your amusement.

First, add the following to sub DoSearch?():

Q: Please, where to add in that sub?. The following snippet is fine, but where to include it on the already existing code on sub DoSearch??

A: You can safely insert it before the line &PrintPageList(&SearchTitleAndBody($string)); in sub DoSearch? and comment out the line &PrintPageList(&SearchTitleAndBody($string)); itself -- UrbanSheep

  if ( $XSearchDisp ) { # managed by config file (?)
    &PrintSearchResults($string,&SearchTitleAndBody($string)) ;
  } else {
    &PrintPageList(&SearchTitleAndBody($string));
  }

And here is sub PrintSearchResults?(): [updated 4/17 with better algorithm and fixes for JM, see below for details]

sub PrintSearchResults {
  my ( $searchstring, @results ) = @_ ;  #  inputs
  my ( $output ) ;

  my ( $name ) ;
  my ( $pageText ) ;
  my ( $t, $j, $jsnippet, $start, $end ) ;
  my ( $snippetlen, $maxsnippets ) = ( 100, 4 ) ; #  these seem nice.

  print "\n<h2>", ($#results + 1), " pages found:</h2>";

  foreach $name (@results) {
    #  get the page, filter it, remove all tags (since we're presenting in
    #  plaintext, not HTML, a la google(tm)).
    &OpenPage($name);
    &OpenDefaultText();
    $pageText = &QuoteHtml($Text{'text'});
    $pageText =~ s/$FS//g;  # Remove separators (paranoia)
    $pageText =~ s/[\s]+/ /g;  #  Shrink whitespace
    $pageText =~ s/([-_=\\*\\.]){10,}/$1$1$1$1$1/g ; # e.g. shrink "----------"
    foreach $t (@HtmlPairs, "pre", "nowiki", "code" ) {
      $pageText =~ s/\<$t(\s[^<>]+?)?\>(.*?)\<\/$t\>/$2/gis;
    }
    foreach $t (@HtmlSingle) {
      $pageText =~ s/\<$t(\s[^<>]+?)?\>//gi;
    }

    #  entry header
    $output = "\n\n" ;
    $output .= "... "  if ($name =~ m|/|);
    $output .= "<font size=+1>" . &GetPageLink($name) . "</font><br>\n" ;

    #  show a snippet from the top of the document
    $j = index( $pageText, " ", $snippetlen ) ;  #  end on word boundary
    $output .= substr( $pageText, 0, $j ) . " <b>...</b> " ;
    $pageText = substr( $pageText, $j ) ;  #  to avoid rematching

    #  search for occurrences of searchstring
    $jsnippet = 0 ;
    while ( $jsnippet < $maxsnippets
           &&  $pageText =~ m/($searchstring)/i ) {  #  captures match as $1
      $jsnippet++ ;  #  paranoid about looping
      if ( ($j = index( $pageText, $1 )) > -1 ) {  #  get index of match
        #  get substr containing (start of) match, ending on word boundaries
        $start = index( $pageText, " ", $j-($snippetlen/2) ) ;
        $start = 0  if ( $start == -1 ) ;
        $end = index( $pageText, " ", $j+($snippetlen/2) ) ;
        $end = length( $pageText )  if ( $end == -1 ) ;
        $t = substr( $pageText, $start, $end-$start ) ;
        #  highlight occurrences and tack on to output stream.
        $t =~ s/($searchstring)/<b>\1<\/b>/gi ;
        $output .= $t . " <b>...</b> " ;
        #  truncate text to avoid rematching the same string.
        $pageText = substr( $pageText, $end ) ;
      }
    }

    #  entry trailer
    $output .= "<br><i><font size=-1>"
        . int((length($pageText)/1024)+1) . "K - last updated "
        . &TimeToText($Section{ts}) . " by "
        . &GetAuthorLink( $Section{'host'}, $Section{'username'},
                         $Section{'id'} )
        . "</font></i><br><br>" ;

    print $output ;

  }
}

Then add initialisation of $XSearchDisp? into the config section under Major options:

$XSearchDisp = 1;       # 1 = extra text output on search, 0 = normal search output

Then add the $XSearchOutput? variable itself into the list under Configuration/constant variables.

I think that here $XSearchOutput? should be $XSearchDisp?, since I couldn't find the $XSearchOutput? anywhere else and this change fixed the problems I was having :) -- JohnVano

Comments, critiques, etc. are welcome. Particularly if anyone can come up with a better way to do the searching (and grabbing of scalar substrings). --MikeDalessio

Whoops, for me --

[Mon Apr 16 17:58:07 2001] wiki.pl: Global symbol "$XSearchDisp?" requires explicit package name at /home/ebooklib.com/cgi-bin/wiki.pl line 3016.

[Mon Apr 16 17:58:07 2001] wiki.pl: Execution of /home/ebooklib.com/cgi-bin/wiki.pl aborted due to compilation errors.

I put $XSearchDisp? = 1; in config. -- JerryMuelver

Got it running by using if(1).... Very interesting. I typically have a menu line or Category link at the top of pages (see http://allmyfaqs.com/cgi-bin/wiki.pl?Submit_just_once or http://allmyfaqs.com/cgi-bin/wiki.pl?HomePage), but that's a controllable style change. Maybe the search function could return the first non-link line for the initial info fetch? The context phrase function is very useful in my search-faq setting. -- JerryMuelver

WOW! Look what Mike hath wrought -- go to http://allmyfaqs.com/cgi-bin/wiki.pl?HTML_FAQs and run searches for hide.*source or maybe back.*button and check the results! I detuned the font sizes (h2 > h3, no font size +1) to get a denser page, but Mike did all the hard work. Wonderful! -- JerryMuelver

BUG -- If the search word/phrase is in the title of the page, the font enhancement breaks the URL -- Include_one_file_in_another becomes <b>Include</b>_one_file_in_another which gives a 404. See http://allmyfaqs.com/cgi-bin/wiki.pl?HTML_FAQs and search for "include", scroll to "Include one file in another". -- JerryMuelver

Fix -- I removed the initial $ouput assignment --

    $output = &GetPageLink($name) . "
\n";

and tacked onto the highlighting line --

    $output =~ s/($searchstring)/<b>$1<\/b>/gi ;
    $output = &GetPageLink($name) . "
\n" . $output;

All is well in the world again! -- JerryMuelver

Blush ... Glad you like it. Thanks much for the bugfix (I must have slipped up when my caffeine levels fell below normal ;). I'm still not completely happy with how the searching is being done (i.e., via the index() function), but I can't think of any other way to do it so that we can subsequently grab a substring. My momma told me that, in perl, There's More Than One Way To Do It, so somebody must have a better idea. Bueller? Bueller? --MikeDalessio

Improved algorithm - New code above supports regexps correctly, using a one-two combo of m// and index(). It also addresses JM's highlighting bug. Now I'm happy with it. --MikeDalessio

Me, too! Thanks, Mike! -- JerryMuelver

Small fix: To avoid html errors one should change

$output .= "<font size=+1>" . &GetPageLink($name) . "</font><br>\n" ;
to
$output .= "<font size=\"+1\">" . &GetPageLink($name) . "</font><br>\n" ;
- Richard


Better search output

How about showing the last header before the match ?

That could also be nice for the index-page (in this case, showing the first header from the page)...

--HaJoGurt


I applied this patch and extended it. You can choose for the original or the Google like output at runtime.

Implementation:

/PatchForBetterSearchOutput

--StefanTrcek

/SearchWithOperators has the code for the OddMuse search which allows "and" and "or" operators, and which adapts the patch on this page for appropriate highlighting. -- AlexSchroeder


KnutK: This patch does not support the wiki-translation-feature :-(

BUT: Here it is: (I use the known T(..) and Ts(..) functions.

I also added a different highlighted search-string: style='background: #FFFFCC'
sub PrintSearchResults {
  my ( $searchstring, @results ) = @_ ;  #  inputs
  my ( $output ) ;

  my ( $name ) ;
  my ( $pageText ) ;
  my ( $t, $j, $jsnippet, $start, $end ) ;
  my ( $snippetlen, $maxsnippets ) = ( 100, 4 ) ; #  these seem nice.

#  print "\n<h2>", ($#results + 1), " ", T('pages found:'), "</h2>";
   print "\n<h2>", Ts('%s pages found:', ($#results + 1)), "</h2>";

  foreach $name (@results) {
    #  get the page, filter it, remove all tags (since we're presenting in
    #  plaintext, not HTML, a la google(tm)).
    &OpenPage($name);
    &OpenDefaultText();
    $pageText = &QuoteHtml($Text{'text'});
    $pageText =~ s/$FS//g;  # Remove separators (paranoia)
    $pageText =~ s/[\s]+/ /g;  #  Shrink whitespace
    $pageText =~ s/([-_=\\*\\.]){10,}/$1$1$1$1$1/g ; # e.g. shrink "----------"
    foreach $t (@HtmlPairs, "pre", "nowiki", "code" ) {
      $pageText =~ s/\<$t(\s[^<>]+?)?\>(.*?)\<\/$t\>/$2/gis;
    }
    foreach $t (@HtmlSingle) {
      $pageText =~ s/\<$t(\s[^<>]+?)?\>//gi;
    }

    #  entry header
    $output = "\n\n" ;
    $output .= "... "  if ($name =~ m|/|);
    $output .= "<b>" . &GetPageLink($name) . "</b><br>\n" ;

    #  show a snippet from the top of the document
    $j = index( $pageText, " ", $snippetlen ) ;  #  end on word boundary
    $output .= substr( $pageText, 0, $j ) . " <b>...</b> " ;
    $pageText = substr( $pageText, $j ) ;  #  to avoid rematching

    #  search for occurrences of searchstring
    $jsnippet = 0 ;
    while ( $jsnippet < $maxsnippets
           &&  $pageText =~ m/($searchstring)/i ) {  #  captures match as $1
      $jsnippet++ ;  #  paranoid about looping
      if ( ($j = index( $pageText, $1 )) > -1 ) {  #  get index of match
        #  get substr containing (start of) match, ending on word boundaries
        $start = index( $pageText, " ", $j-($snippetlen/2) ) ;
        $start = 0  if ( $start == -1 ) ;
        $end = index( $pageText, " ", $j+($snippetlen/2) ) ;
        $end = length( $pageText )  if ( $end == -1 ) ;
        $t = substr( $pageText, $start, $end-$start ) ;
        #  highlight occurrences and tack on to output stream.
        $t =~ s/($searchstring)/<b style='background: #FFFFCC'>\1<\/b>/gi ;
        $output .= $t . " <b>...</b> " ;
        #  truncate text to avoid rematching the same string.
        $pageText = substr( $pageText, $end ) ;
      }
    }

    #  entry trailer
    $output .= "<br><i><font color=gray>"
        . int((length($pageText)/1024)+1) . "K - " . T('Last edited') . " "
        . &TimeToText($Section{ts}) . " " . T('by') . " "
        . &GetAuthorLink( $Section{'host'}, $Section{'username'},
                         $Section{'id'} )
        . "</font></i><br><br>" ;

    print $output ;

  }
}
--KnutK

UseModWiki | WikiPatches | RecentChanges | Preferences
Edit text of this page | View other revisions | Search MetaWiki
Last edited February 3, 2005 12:11 am by user-10cmeae.cable.mindspring.com (diff)
Search: