Free Online Courses for Software Developers - MrBool
× Please, log in to give us a feedback. Click here to login
×

You must be logged to download. Click here to login

×

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

×

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

How to create your own search engine with PHP and MySQL

Search engine has become a useful tool in today's internet world. In this article we will evaluate its different uses, basics and tools

Search engine has become a useful tool in today's internet world. It is very helpful for developers, testers, managers and other internet users. Using this we can get the information on any topic of our choice. In this article, we will talk about creating our own search engine with the help of PHP and MySQL. Our goal is not to replace the big giants e.g. Google, Yahoo etc but to give a good attempt in order to have our own search engine. In this article we will talk about the basics of search engine and then see how to develop our own search engine using PHP and MySQL.

What is a search engine?

A search engine is a web-based tool which allows the internet users to find information on the internet. Most commonly used search engines are Google, Yahoo!, MSN, Bing, Ask etc. Search engines are special types of programs used to search documents having specified keywords and returns a list of documents where the keywords are located. Asearch engine is usually a general collection of programs. However, the term ‘search engine’ is often used to generally describe the common systems like Google, Bing and Yahoo! Search engines generally use automated software applications e.g robots or spiders which moves across the Web and follows the links from page to page, site to site. The information collected by these crawlers is used to create a searchable index of the Web.

How do search engines work?

Any search engine used is based on several complex mathematical formulae used to generate the search results. The results obtained for the specific query are then displayed on the SERP or the Search Engine Results Page. Search engine algorithms fetch the key elements from the web page which includes the page title, content and the keyword density. It then comes up with a ranking based on which the results are placed on the pages. Each and every search engine has its own unique algorithm. Hence a search result which has a top ranking on Yahoo does not guarantee a similar ranking on Google and vice versa. These search algorithms which are used by search engines are

  • very closely guarded secrets,
  • they also constantly undergo modification and revision.

Thus the criteria to best optimize a site should be checked through constant observation, along with several attempts not just once, but should be done continuously.

It is generally said that the less reputed SEO (Search Engine Optimization) firms hype, since the answer to better site rankings will work at its best for only a limited period till the time the developers of the search engine become wise enough to the tactics and change their algorithm. Normally sites using these tricks are labeled as spam by the search engines and as a consequence, their rankings go down.

Search engines only check the texts on web pages, and use the underlying HTML structure in order to find out the relevance. Images, photos, or dynamic Flash animated contents are meaningless to search engines; however, the actual text content is more relevant for the search engines. It is a challenging task for the flash developers to build a Flash site which will be friendly to the search engines. As a result, the flash sites will tend not to rank as high as compared to sites which are developed with well coded HTML and CSS (Cascading Style Sheets — a complex mechanism for adding styles to website pages above and beyond regular HTML). If the terms, we are looking to be found by do not appear in the text of our website, it will be very difficult for the website to produces a high placement on the Search Engine Result Pages.

Web Search Engines

We normally refer to Web Search Engine while talking about Search Engines. A typical Web Search Engine starts working by sending out a spider which has the ability to fetch as many documents as possible against the supplied keywords. There is another program called the indexer, and then starts reading these documents and starts creating the indices based on the tokens found in each document. Each search engine uses its own proprietary algorithm to create the indices in such a way that ideally, only meaningful results are returned for each query.

Since almost every website owners rely on the fact that the search engines will send traffic to their website and also the entire industry has grown around the concept of optimizing Web content in order to improve their placement in the search engine results, we should acquire adequate knowledge about search engine optimization or SEO.

Characteristics of Search Engine

  • Unedited – Search engines are unedited. Anyone can enter the search content.
    • Some search engines come up with inbuilt quality issues;
    • Search engines can mark the websites as Spam if the websites uses some tricks in order to be on the top of the SERP.
  • Search engines comes up with an array of information types e.g. Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place!
  • Search engines have the ability to cater the needs of different kinds of users.

Tips to be followed while Using Search Engines

We must ensure to be called as one of the best web searcher that whatever we need to type in order to tell the search engine should be exactly what we are looking for. The search operators fall into two broad categories –

  • Boolean operators and
  • Non-Boolean operators

An operator is a word or symbol which we type. It provides the search engine the directions help it knows what to search for. Using these operators we can either narrow or broaden our search, thus helping us to locate the web sites which may be useful to you.

We need to check out the individual search engines to find out which operators it is based upon.

Boolean Searching Techniques:

  • AND - This means the search results must have both the words. Very often it is typed in UPPER CASE, but is not mandatory. This reduces the count of the websites. Thus narrowing in on specific topics. e.g. air AND water AND pollution will search for the web sites which contain all the three keywords - air, water and pollution
  • OR - This means that the search results may have either of the terms that is entered. This increases the number of web sites which will be looked upon, thus broadening our search. E.g. air OR pollution OR water will look for the web sites that contains either air or pollution or water.
  • NOT - This means any result containing the second keyword or the token will be excluded from the search result. Using NOT decreases the number of sites that will be listed as a result of our search. Thus it narrows our search. e.g. if we enter the tokens - air NOT pollution, the results will list out the websites that contain air but not pollution.

We should be very careful about while NOT because if a website mentions pollution even once, it may be excluded from the search result. This could lead to exclusion of some important web sites. Similarly we should be careful while using OR. as this may end up with a huge number of sites which require sort through.

Non-Boolean Searching Techniques:

  • + This works in the same way the AND works. This makes the token mandatorily required in the search results. The symbol + should be placed directly in front of the search token and should not have any spaces. e.g. air+water+pollution will look for all the web sites that contain these three tokens - air, water and pollution.
  • – This works in the same way the NOT works. This is meant to exclude the token which follows the symbol -. e.g air - pollution will list out all the websites that contain air but not pollution.
  • “ ” This symbols is placed around the tokens to indicate that the search engine will look for the exact phrase. e.g. “air water pollution” will look for that exact phrase. This will make your search very specific.

Similar to the Boolean terms, we must be careful while using – as this can eliminate the web sites that might mention the term which we do not want, but are not really about that term. –

Water will eliminate all the web sites that are about air pollution but talk about water pollution as well.

Creating our own search engines

Being tired of using the generic search engines e.g. Google and Yahoo every time, we wish to look out for anything. Best idea would be to try and build our own Search Engine using the open source technologies like PHP and MySQL. Let me make it clear that our goal is not to throw the big giants e.g. Google, Yahoo, Bing etc out of the market, but we can give a good attempt to have our own search engine.

In this approach we will learn to build our own search engine and eventually we would see the visitors coming and doing a search on our website in mass with the help of an Html search form, having the standard button. Here we will use the php language and MySql database services in order to implement this feature. Hence it is expected from the readers to have a good knowledge of the basic concepts of both of these before go and start implementing. In this document, we will use the most basic code and not go through the complex sql queries. Here, we can assume that the basics of Structured query language or SQL is known to everyone and you have been using it in some form or other more often. So now let us focus on our first HTML code which will help us to create a Search Button and form which is going to be used by every visitor to enter in any query.

Database Part:

As we have mentioned that MySQL is one of the prerequisite in our approach, our first step would be setup the MySQL database up and running. Connect to MySQL, we can any use any of the UI based free tools e.g. Squirrel, HeidiSQL or DBVisualiser or the MySQL admin console. Once connected, let run the following SQL which will create a table called SEARCH_ENGINE.

Listing 1: An SQL statement which will create a table –

CREATE TABLE SEARCH_ENGINE (
       `id` INT(11) NOT NULL AUTO_INCREMENT,
       `pageurl` VARCHAR(255) NOT NULL,
       `pagecontent` TEXT NOT NULL,
       PRIMARY KEY (`id`))

The above query will create a table in the database which will be used to store the details or information to be stored in the database.

Creating the Form:

Now, once the database is ready, let us make the form which will be used by the visitors or the end users to perform their search. Let us call this file - 'index.php' which is a simple search forms having a button. Here we will use GET instead of POST. Thus the information is made quite visible in the address bar.

Listing 2: Our index.php file –

 
<html>
       <head>
             <title> My search engine </title>
       </head>
       <body>
             < form action = 'search.php' method = 'GET' >
                    < center >
                           <h1 > My Search Engine </h1 >
                           < input type = 'text' size='90' name = 'search' >
                           </ br >
                           </ br >
                           < input type = 'submit' name = 'submit' value = 'Search source code' >
                           < option > 10 </ option >
                           < option > 20 </ option >
                           < option > 50 </ option >
                    </ center >
             </ form >
       </ body >
</ html > 

Our form is now completed and ready to be used. This form will be used by the end users to enter in a query and at the same time will enable the users to restrict the count of results which needs to be shown on the form.

Processing the Query:

Let us create a new file 'search.php' which is the page where the results from the search will be listed or shown. This file is divided into following sections -

· Let us connect to the database first:

Listing 3: DB connection

 
       mysql_connect ( "localhost", "USER_NAME", "PASSWORD" ) ; 
       mysql_select_db ( "DB_NAME" );

· Form the query - Once we are connected to the DB, we then form the query using the tokens that the end users have entered. This is shown below -

Listing 4: Construct the query along with the tokens users have entered –

 
       $search_exploded = explode ( " ", $search );
       $x = 0; 
       foreach( $search_exploded as $search_each ) {
             $x++;
             $construct = " ";
             if( $x == 1 )
                    $construct .= "keywords LIKE '%$search_each%' ";
             else
                    $construct .= "AND keywords LIKE '%$search_each%' ";
       }
       $construct = " SELECT * FROM SEARCH_ENGINE WHERE $construct ";
       $run = mysql_query( $construct ); 

· Our next job is to fetch the results from the database and present it to the user. If the search doesn't yield any result, we should show an appropriate message to the user as shown below -

Listing 4: Fetch the result and present it to the user –

 
       if ($foundnum == 0)
             echo "Sorry, there are no matching result for <b> $search </b>.
             </ br >
             </ br > 1. Try more general words. for example: If you want to search 'how to create a website' then use general keyword like 'create' 'website'
             </ br > 2. Try different words with similar  meaning
             </ br > 3. Please check your spelling"; 
                    else {
                           echo "$foundnum results found !<p>";
                           while ( $runrows = mysql_fetch_assoc($run) ) {
                                  $title = $runrows ['title'];
                                  $desc = $runrows ['description'];
                                  $url = $runrows ['url'];
                                  echo "<a href='$url'> <b> $title </b> </a> <br> $desc <br> <a href='$url'> $url </a> <p>";
                    }
             } 

Now our Search engine is ready to be used. The code explained above in parts is listed under -

Listing 5: The Complete Search.PHP file –

<?php
       $button = $_GET [ 'submit' ];
       $search = $_GET [ 'search' ]; 
 
       if( !$button )
             echo "you didn't submit a keyword";
       else {
             if( strlen( $search ) <= 1 )
                    echo "Search term too short";
             else {
                    echo "You searched for <b> $search </b> <hr size='1' > </ br > ";
                    mysql_connect( "localhost","USERNAME","PASSWORD") ; 
                    mysql_select_db("DBNAME");
 
                    $search_exploded = explode ( " ", $search );
                    $x = 0; 
                    foreach( $search_exploded as $search_each ) {
                           $x++;
                           $construct = "";
                           if( $x == 1 )
                                  $construct .="keywords LIKE '%$search_each%'";
                           else
                                  $construct .="AND keywords LIKE '%$search_each%'";
                    }
 
                    $construct = " SELECT * FROM SEARCH_ENGINE WHERE $construct ";
                    $run = mysql_query( $construct );
 
                    $foundnum = mysql_num_rows($run);
 
                    if ($foundnum == 0)
                           echo "Sorry, there are no matching result for <b> $search </b>. </br> </br> 1. Try more general words. for example: If you want to search 'how to create a website' then use general keyword like 'create' 'website' </br> 2. Try different words with similar  meaning </br> 3. Please check your spelling"; 
                    else {
                           echo "$foundnum results found !<p>";
 
                           while( $runrows = mysql_fetch_assoc( $run ) ) {
                                  $title = $runrows ['title'];
                                  $desc = $runrows ['description'];
                                  $url = $runrows ['url'];
 
                                  echo "<a href='$url'> <b> $title </b> </a> <br> $desc <br> <a href='$url'> $url </a> <p>";
 
                           }
                    }
 
             }
       }
 ?>

Search Engine architecture

Before going into further details, let us talk about what should be our goals while developing a search engine. Listed below is a brief set of goals which we should be focused on -

  • WebCrawler, indexer and document storage which should be capable of handling a large volume of documents may be 1 million or even more. .
  • We should follow the test driven development which would help to enforce good design and modular code.
  • We should have the ability to support various strategies for things like the index, document store, search etc.

A typical search engine consists of few parts -

  • A crawler which is used to pull external documents.
  • An index which is the place where the documents are stored in an inverted tree and
  • A document store to keep the documents.

THE CRAWLER

In order to crawl, we should come up with a list of URL’s. There are a few generic ways to do this as listed under -

  • The most common is to feed the crawler with a list of links which contain lots of links as listed. Our next job is to crawl them and harvest as we go down the list
  • Another approach is to download a list of URL’s and then use that list.

Since our aim is to get the actual website only, let us write a simple parser to extract the appropriate data out. It is quite straight forward as shown below -

Listing 6: The parser –

                $file_handle = fopen( " Quantcast-Top-Million.txt ", "r" );
 
       while ( !feof ( $file_handle ) ) {
             $line = fgets( $file_handle );
             if( preg_match( '/^\d+/',$line ) ) { # if it starts with some amount of digits
                    $tmp = explode( "\t",$line );
                    $rank = trim( $tmp[0] );
                    $url = trim( $tmp[1] );
                    if( $url != 'Hidden profile' ) { # Hidden profile appears sometimes just ignore then
                           echo $	
			}
		}
	}
	fclose( $file_handle );

DOWNLOADING

Downloading the data is going to take some time hence we should be prepared for a longer wait. We can write a very basic crawler in PHP simply by using a file_get_contents and sticking in a url. Let us have a look into the following code -

Listing 7: The crawler –

        $file_handle = fopen("urllist.txt", "r");
         while (!feof($file_handle)) {
                 $url = trim(fgets($file_handle));
                 $content = file_get_contents($url);
                 $document = array($url,$content);
                 $serialized = serialize($document);
                 $fp = fopen('./documents/'.md5($url), 'w');
                 fwrite($fp, $serialized);
                 fclose($fp);
         }
         fclose($file_handle);

The above code is essentially a single threaded crawler. It simply loops over every url in the file, extracts down the content and then saves the content to the disk. The only thing we should note here is that it stores the url and the content in a document since we might need to to use the URL for ranking purpose and also it is helpful to keep a track where the document came from. We should keep in mind that we may run out of file system storage limits while trying to store lots of documents in one folder.

THE INDEX

The reason I initially talked about the test driven development mechanism, is that I prefer the bottom up approach. The index, which we are going to create, should have a few very simple responsibilities as listed under -

  • It needs to store its contents to disk and retrieve them.
  • It needs to be able to clear itself when we decide to regenerate things.
  • It should validate documents that its storing.

Having these tasks defined Let us have the following interface in place -

Listing 8: The interface –

        interface iindex {
                 public function storeDocuments($name,array $documents);
                 public function getDocuments($name);
                 public function clearIndex();
                 public function validateDocument(array $document);
         }

THE DOCUMENT STORE

The document store is a somewhat odd if we are going to index things that we probably already have what we wanted to be stored somewhere else. The most obvious thing in this case is that the documents are already in some database.

THE INDEXER

The next step in our approach to build our search is to create the indexer. An indexer takes a document, breaks it apart and feeds it into the index, and also possibly to the document store depending upon our implementation.

INDEXING

Now that we have the ability to store and index some documents. Let us go through the steps we need here to have the indexing in place -

  • The first thing we are supposed to do here is to set the time limit to unlimited since the indexing process might take a longer time than expected.
  • Our next step is to define the position of the index and the documents that are going to stay in order to avoid the errors.

SEARCHING

Searching requires a relatively simple approach. In fact we only require a single method as shown below -

Listing 9: The search interface –

                interface isearch {
                       public function dosearch($searchterms);
       }

Of course, the actual implementation is not that easy. It is rather more complex than it appears.

Conclusion

Through this document, I have tried to cover the different areas of search engine and its features. Also I have discussed on how to create our own search engine with the help of MySQL and. Let us conclude this article in following bullets -

  • Search engine is a powerful and useful tool in today's Internet world.
  • A search engine is based on several complex mathematical formulae which are used to generate the search results.
  • The results obtained for the specific query are then displayed on the SERP or the Search Engine Results Page.

I hope by now you have got a good idea of JDB or Java Debugger and also hope that you have enjoyed reading this document.



Website: www.techalpine.com Have 16 years of experience as a technical architect and software consultant in enterprise application and product development. Have interest in new technology and innovation area along with technical...

What did you think of this post?
Services
[Close]
To have full access to this post (or download the associated files) you must have MrBool Credits.

  See the prices for this post in Mr.Bool Credits System below:

Individually – in this case the price for this post is US$ 0,00 (Buy it now)
in this case you will buy only this video by paying the full price with no discount.

Package of 10 credits - in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download few videos. In this plan you will receive a discount of 50% in each video. Subscribe for this package!

Package of 50 credits – in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download several videos. In this plan you will receive a discount of 83% in each video. Subscribe for this package!


> More info about MrBool Credits
[Close]
You must be logged to download.

Click here to login