MrBool
You must be logged in to give feedback. Click here to login
[Close]

You must be logged to download.

Click here to login

[Close]

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

[Close]

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

How to create a Web Crawler and storing data using Java

In this article we will see how to make a program to crawl a webpage and to use it obtain the page statistics.

[close]

You didn't like the quality of this content?

Would you like to comment what you didn't like?

Introduction:

Internet has become a basic necessity and without it, life is very difficult now-a-days. With the help of internet, a person can get a huge amount of information related to any topic. A person uses a search engine to get information about the topic of interest. The user just enters a keyword and sometimes a string in the text-field of a search engine to get the related information. The links for different web-pages appear in the form of list and this is a ranked list generated by the necessary processing in the system. This is basically due to the indexing done inside the system in order to show the relevant results containing exact information to the user. The user clicks on the relevant link of webpage from the ranked list of web-pages and navigates through the respective webpage. Similarly, sometimes there is a need to get the text of a webpage using a parser and for this purpose many html parsers are available to get the data in the form of text. When the tags are removed from a webpage then in order to do the indexing of words, some processing is needed to be done in the text and get some relevant results to know about the words and the set of data present in that webpage respectively.

Description:

The indexing helps to know about the different words which came in the webpage and the importance of those words which also helps in knowing about the topic specific information. The web crawling is helpful in the manner that the information related to a webpage can be saved and the processing can be done to know about the links it is pointing to as well. The user is satisfied with the type of information one gets about the related links and get to know about the whole topic. It also helps in creating the documents in a fast manner from the qualitative information present in the World Wide Web (WWW). The web crawler helps in maintaining the information got from the internet.

Overview:

This application helps in indexing the words available in a webpage and storing the words in a table available in a respective database. The words will be counted and this information will also be stored in a database which helps in searching from the table that how many times some specific words appeared in the webpage and this helps in knowing about the topic. The stop words and unnecessary characters will be removed from the text of webpage after successful crawling of the particular webpage and these will not be stored in the table.

The library j-soup in Java is used in order to get the webpage in the form of HTML and then to parse the HTML from the present webpage and finally get the text of full webpage. This library helps in extraction and manipulation of data in an excellent manner and no problems appear to work with the real world web-pages. The text appears in the text-area present in the front interface of the Java application.

Screenshots:

The following screenshots show the application and the data present in the database before and after the operations performed from an application and changes applied to the back-end database from the front end and all these interfaces are shown one by one.

Application and Data present in the Database

Figure 1: Application and Data present in the Database

This is the front-end of an application and it performs the operation according to the click of a button depending on the required functionality. The webpage from internet is first got and then the processing applied on the text after removing the HTML tags from the webpage.

Front-end Page

Figure 2: Front-end Page

This interface shows the functionality performed when the respective button is pressed to get the desired results.

Functionality Performed

Figure 3: Functionality Performed

This is the output from the database system when the data is inserted in the table of Words_Count_and_Webpage_URL.

Database System

Figure 4: Database System

Implementation & Code:

Below we will discuss the logic of performing the above operations.

Listing 1 Get HTML of Entered URL of Webpage Button Code:

private void jButton2ActionPerformed(java.awt.event.ActionEvent evt) {                                         
        // TODO add your handling code here:
        String webpage_link;
        webpage_link=jTextField1.getText();
        try{
            Document d=Jsoup.connect(webpage_link).get();
            jTextArea1.append(d.toString());            
        }catch(Exception e){
            e.getMessage();
        }        
    }  

In the above code we have made use of JSoup library to connect to webpage and this will return the HTML of the provided page. JSoup makes the extraction process really easy.

Listing 2 Get Get Webpage Text Without HTML Tags and Store Data in Database Button Code:

private void jButton3ActionPerformed(java.awt.event.ActionEvent evt) {                                         
        // TODO add your handling code here:
        String gethtml;
        gethtml=jTextArea1.getText();
        String get_form_of_text;
        get_form_of_text=Jsoup.parse(gethtml).text();
        jTextArea2.append(get_form_of_text);
        try{        
            String form_of_words;
            form_of_words=jTextArea2.getText();                
            String[] stripChars = { ":", ";", ",", ".", "-", "_", "^", "~", "(", ")", "[", "]", "'", 
                                    "?", "|", ">", "<", "!", "\"" , "{","}" , "/", "*","&","+","$","@","%","`","#","="};
            for(String removingtt : stripChars){
                form_of_words=form_of_words.replace(removingtt,"");//to remove unncecessary characters
            }            
            String[] dataarr=null;
            int count=0;
            dataarr=form_of_words.split(" ");
            for(int j=0;j<dataarr.length;j++){
                if(dataarr!=null){
                    dataarr[count]=dataarr[j];
                    count++;
                }
            }            
            HashMap<String,Integer> freq=new HashMap<String,Integer>();            
            Integer number1=0;            
            for (String ww:dataarr){
                number1=freq.get(ww);
                if (number1 != null){
                    freq.put(ww,number1+1);
                }else{
                    freq.put(ww,1);
                }
            }
            ArrayList<String> stored=new ArrayList<String>();
            for(String stdata:dataarr){
                if(stdata!=null){
                    if(!stored.contains(stdata)){
                        stored.add(stdata);
                    }
                }
            }
            String[] arr_words=new String [stored.size()];
            int[] arr_count_words=new int [stored.size()];
            int rr=0;
            for(Entry<String,Integer> ent: freq.entrySet()){                
                arr_words[rr]=ent.getKey();
                arr_count_words[rr]=ent.getValue();
                jTextArea3.append(arr_words[rr]+"\t"+arr_count_words[rr]+"\n");
                rr++;
            }                        
            String webp;
            webp=jTextField1.getText();
            Class.forName("org.apache.derby.jdbc.ClientDriver");
            String connstr="jdbc:derby://localhost:1527/Management";            
            Connection co= DriverManager.getConnection(connstr, "use", "system");
            Statement st=co.createStatement();
            for(int k=0;k<dataarr.length-1;k++){    
                String in="insert into USE.WORDS_COUNT_AND_WEBPAGE_URL values ('" + webp + "', '" + arr_words[k] +"' , " + arr_count_words[k] + ")";
                st.executeUpdate(in);
            }
            co.close();
            st.close();
            jLabel6.setVisible(true);
        }catch(Exception ex){
            ex.getMessage();
        }
    }     

First of all we obtain the output from our first listing and we used the JSoup library to remove th HTML tags from the source obtained so that we are left with plain text.

We try to remove some unnecessary characters which may have been introduced while converting from html to text and we split the text according to “space” delimiter. This help us to obtain the word count. We also store this word in an array for further processing.

Now we check the frequency of each word we stored in the array but using iteration and counter variables and then, we store this information in the database.

So, we learned about the crawling of webpages using JSoup library and also how to further operate the data obtained according to our needs. If you have some query please let me know.

Hope you liked it. See you next time.



My main area of specialization is Java and J2EE. I have worked on many international projects like Recorders,Websites,Crawlers etc.Also i am an Oracle Certified java professional as well as DB2 certified

What did you think of this post?

Did you like the post?

Help us to keep publishing good contents like this.

SUPPORT US

funded

remaining

[Close]
To have full access to this post (or download the associated files) you must have MrBool Credits.

  See the prices for this post in Mr.Bool Credits System below:

Individually � in this case the price for this post is US$ 0,00 (Buy it now)
in this case you will buy only this video by paying the full price with no discount.

Package of 10 credits - in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download few videos. In this plan you will receive a discount of 50% in each video. Subscribe for this package!

Package of 50 credits � in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download several videos. In this plan you will receive a discount of 83% in each video. Subscribe for this package!


> More info about MrBool Credits
[Close]
You must be logged to download.

Click here to login