Bilkent University

Department of Computer Engineering

 

Senior Project

Computer Science Resource Portal

 

By

 

Süleyman ÇETİNTAŞ

Akif BOYNUEĞRİ

Hakan YILMAZ

Ozan Özcan DOLU

 

Supervisor:

Özgür ULUSOY

 

Final Report

May 5, 2006

 

This report is submitted to the Department of Computer Engineering of Bilkent University in partial fulfillment of the requirements of the Senior Projects course CS492

 

 

 

Content

 

1      Introduction. 2

2      Document Collecting. 3

2.1       Rule-based Focused Crawler Overview: 3

2.2       Extensions for Rule Based Focused Crawler 5

2.2.1        Bug Fixing. 5

2.2.2        Sub Document Classification. 6

3      Processing Document Collection. 11

3.1       Classification of Web Pages. 11

3.2       Processing Data for Basic Keyword Searching. 13

3.3       Processing Data for Advanced Search with Information Extraction. 15

3.3.1        Information Extraction with SRV.. 16

3.3.2        Processing SRV Output 20

4      Querying the Portal 22

4.1       Keyword-based Querying. 22

4.2       Advanced Querying. 22

5      Conclusion. 23

6      Acknowledgement 24

7      References. 24

8      Users Manual 25

8.1       Simple Search: 25

8.2       Advanced Search. 26

 

 

1         Introduction

 

Computer Science Resource Portal aims to provide students, researchers and instructors to access the most up-to-date educational resources from the internet. CSRP offers facilities to conduct advanced search and simple keyword based search on computer science related web pages. 

 

Computer Science Resource Portal consists of three main stages. These stages are; document collecting, processing the document collection, searching and querying of the portal. 

 

·        In the document collecting phase, the entire domain relevant web pages are gathered from the Web. The task of gathering all and only relevant pages of interest (i.e. computer science resources) from only the relevant regions of the Web is accomplished by focused crawling which is introduced by Chakrabarti [1]. However Chakrabarti’s crawler possesses the inability to model tunneling, so a new crawler, namely rule based focused crawler that is recently proposed [3], is used in our system. We have debugged the crawler that we possess for a long & intensive period for some errors that will be mentioned later. And we have improved the focused crawling approach by focusing sub-document parts that will be explained later.

 

·        In the document processing phase, the documents that we gathered in the document collection phase have been prepared in order to be used in the querying phase. For simple search, we have built a document vector and an inverted index out of the documents. For structured search, we have classified the documents into four categories, namely course home pages, department home pages, personal home pages and research pages. A stochastic real valued unit learning algorithm (SRV) [10] has been used in each of these classification categories to extract valuable information [metadata]. Extracted information has been inserted into database tables created respected to classification categories.

 

·        In querying phase, users are able make simple search and advanced search on downloaded documents via a user friendly interface. Simple querying is performed by Full Search (FS) algorithm. Advanced search is performed on the structured data that is stored in database by using database facilities.

 

2         Document Collecting

 

For gathering documents from the Web we used Rule-based Focused crawler that is formerly implemented despite the fact that it could not meet the criteria required by our system. So, we propose sub-document classification feature as an extension for rule-based focused crawler and fix some bugs in the program. When we got the crawler, it was downloading 500-600 pages in average. After fixing the bugs in the code, now we achieved to improve it so that it downloads up to 30000-40000 pages. Fixing bugs for the crawler has been a quite challenging task, because the crawler is a big multithreaded program and debugging the crawler has taken much more time than it had been expected.

 

As our crawler provides online data that has recently been downloaded from the internet, we use another source of offline web documents as our second document collection. A large data-set downloaded from Google Education (‘.edu’) Domain that contains approximately 800.000 documents.  

2.1       Rule-based Focused Crawler Overview:

 

Architecture of the crawler that we used in our document collecting phase is seen on Figure 2.

 

 

Figure 2- Flowchart of Crawler [11]

 

 

 

Current Functionalities of Crawler:

 

1.      Offline-Online Crawling: The crawler downloads web pages from the internet via following the links that are extracted from formerly downloaded pages. The downloading process begins with downloading some seed web-pages and following the links that are extracted from them.

In addition to that, already downloaded web pages are used for crawling simulation which is called ‘offline crawling’. Offline crawling option is used especially for debugging and testing purposes.

 

2.      Multiple Crawling Modes: Crawler also supports multiple crawling modes. These are:

  1. Hard Focus Crawling: Crawler is trained for one topic and it decides whether a page is off-topic or on-topic according to the score that the classifier assigns to the page. Depending on the score given, it discards or saves the page.
  1. Soft Focus Crawling: Crawler is trained on multiple topics and its classifier decides the score of downloaded page. After extracting the links from this page, this score is assigned to the extracted links. Links that have the highest priority is placed first in a priority queue that holds all the extracted links and are downloaded before the ones with lower priorities.
  1. Rule Based Focus Crawling: Links that are extracted from the downloaded pages have a priority determined by some rules that are already defined. Idea is similar with focus crawling but tunneling is also supported in rule based focus crawling which makes it more powerful.

 

3.      Sub-Document Classification: This option is used in soft focus crawling option and rule based focus crawling option. First web page is partitioned into logical sub parts. Instead of assigning same classifier score to whole page; by this option, these sub-parts are classified and links from these sub-parts are put into priority queue with the scores of their respected sub-parts. So the focus is not on the whole page, but on the sub-parts of a page with this approach.

 

4.      Politeness: Some web servers automatically disallow IP’s those send lots of http request at a certain period of time. In order to prevent crawler to disturb web servers that causes server to ban the crawler’s IP, politeness is used. Politeness guarantees that a second request sent to a web server will be after a certain amount of time of the first request that is sent to the same server.

 

5.      IP Restriction: By using IP restriction, only the pages that are in a given IP range are downloaded. This will allow crawler to download pages from one host.

 

6.      Resuming Crawling: If crawler stops for any reason, like some problems on server, program is enabled to continue crawling by maintaining last state.

2.2       Extensions for Rule Based Focused Crawler

2.2.1       Bug Fixing

The crawler that we used in our system had some bugs initially. To be able to use it in our system, we had to make a long and intensive debugging. The following is a short summary of the bugs that we encountered and how they are fixed.

 

Database sizes: We reduced the high space requirement of the databases that hold the links that are extracted from the downloaded pages. By reducing the size of database nodes for each URL, size of the databases holding these URLs is reduced by the same factor. Also, instead of allocating fixed space for each database node dynamic memory allocation can be done but this is implemented yet.

 

Links in Frame Tags: Previous version of program was not extracting the links in the “frame” tag. We modified the HTML parser in the program to catch these links too. Some index pages only contain frames, so classification results of these empty pages would be low. Thus the links in a page that contains frames are put into priority queue with its grandparent’s score to avoid this situation.

 

Excessive Header Size: If header size of requested URL is more than header size limit, program used to terminate itself with segmentation fault signal. We have avoided this situation that is frequently encountered by putting a threshold value on the size of header.

This change remarkably increased the number of pages that are downloaded at a time.

 

Program stops crawling: After conducting some intensive tests on the crawler, we observed that crawler stops to download pages after it downloads a number of pages.  We had initially thought that the threads that are responsible for the crawling task fall into a deadlock situation or stuck an in infinite loop. However after some tests, we have discovered that this situation occurs as a result of the blocking “read” system call. This minor error will be got rid of in a short period of time. And we think that this is the only error that is left in the crawler, as out of the last 9 run sessions we could only observe the same error. Therefore the debugging process of the crawler is nearly complete.

2.2.2       Sub Document Classification

We have improved the focusing capability of the rule based focused crawler that we posses, by introducing a sub-document classification algorithm and embedding it into the classification phase of the crawler. Subdocument classification procedure is explained in this section.

 

WWW has many web pages which are of many types. In our approach only HTML based web pages are taken into consideration. The tags that are used to divide the web pages into subparts are introduced in Table 1 and explained in the following sections.

 

Exp.

HTML Tag

Tag Explanation

 

 

 

 

 

Group of Tags for Lists

 

 

 

<UL>

Unordered List:

Used as <UL> <LI>* </UL>.

 

<OL>

Ordered List:

Used as <OL> <LI>* </OL>.

 

<DIR>

Directory List:

Used as <DIR> <LI>* </DIR>.

 

<DL><DT><DD>

Definition List [<DL>],

Definition Term [<DT>],

Definition Definition [<DD>]:

Used as <DL>(<DT><DD>)*</DL>

<MENU>

Menu item list:

Used as: <MENU>(<LI>)*</MENU>

 

 

Tag for Table

 

 

<TABLE>

Tables:

Used as

<TABLE><TBODY>

(<TR><TD></TD></TR>)*

</TBODY></TABLE>

Paragraph Tag

 

 

 

<P>

Paragraph:

Used as: <P></P>

 

Table 1: HTML tags that are utilized in dividing web pages into subparts

 

            Utilization of List Tags:

            The group of tags for lists is a feasibly appropriate group to be utilized for dividing a page into contextually coherent subparts. Since a ‘list’ generally lists ‘related items’, it is normal to think of all the items in a list to be contextually coherent. Therefore in our system, all the lists are taken as a subgroup of a page.

 

Utilization of Paragraph Tag:

            Paragraph tag is also used for the identification of contextually coherent subparts of web page. Since a paragraph consists of a set of related sentences that are usually of the same topic, it’s a feasible idea to take them as a subgroup of the page.

            Although utilization of the paragraph tag seems to be perfect in theory, it’s not in practice. The HTML code generation programs tend to put paragraph tag for many purposes out of the way it’s thought here. The usage of a threshold for the numbers of words in the ‘paragraph’ works to get rid of all of these usages of paragraph tag. Although the value of the threshold may be specified experimentally, it’s feasible to have it as 500. When this threshold value is too small, then very small paragraphs can be taken as a subgroup and classification of these tiny groups is mostly unsuccessful. In the case of a big threshold, the paragraphs that should be taken as subgroups, i.e. the paragraphs whose number of words is enough for a correct classification but is below the threshold, are not identified in a web page, so the utilization becomes low.

 

Utilization of Table Tag:

Another tag that is appropriate to be used in the work of dividing a page into contextually coherent subparts is the table tag. A table normally contains related items in its rows and columns.

However tables are also used for formatting purposes, just to keep unrelated parts of pages together. In these cases since the items that are hold in the rows and columns of the tables are unrelated, they should not be taken in the same subgroup. Therefore it’s crucial to differentiate between these different usages.

To differentiate the different usages of tables, it’s a good way to take every table as a whole web page and try to divide it in this way. In this way even though a table is used only for formatting purposes, since it’s taken as web page and parsed in this way, there will be no problem.

 

 

The Algorithm to Divide a Web Page into Contextually Coherent Subparts

 

// content array holds the web page context

 for each character ch in content

                // process <UL> tag

if ch is the beginning of <UL> | <ul> character series

                then create a new subgroup and catch the content until </UL> | </ul> is seen

 

                // process <OL> tag

else if ch is the beginning of <OL> | <ol> character series

                then create a new subgroup and catch the content until </OL> | </ol> is seen

 

// process <DIR> tag

else if ch is the beginning of <DIR> | <dir> character series

                then create a new subgroup and catch the content until </DIR> | </dir> is seen

 

// process <DL> tag

else if ch is the beginning of <DL> | <dl> character series

                then create a new subgroup and catch the content until </DL> | </dl> is seen

 

// process <MENU> tag

else if ch is the beginning of <MENU> | <menu> character series

                then create a new subgroup and catch the content until </MENU> | </menu> is seen

 

// process <P> tag

else if ch is the beginning of <P> | <p> character series

                then         if number of characters betw (<P>&</P> | <p>&</p>) > MIN_P_TAG_THRESHOLD

then create a new subgroup and catch the content until </P> | </p> is seen

               

 

                // process <TABLE> tag

                // also seek for all other tags in the text of the table

if ch is the beginning of <TABLE> | <table> character series

then   

for each ch  that is not the end of </TABLE> | </table> character series

               

                                // ALSO SEEK FOR OTHER SUBPARTS IN THE TABLES

//  to handle the usage of TABLES FOR FORMATTING PURPOSES

                                               

// process <UL> tag

if ch is the beginning of <UL> | <ul> character series

                                hen create a new subgroup and catch the content until </UL> | </ul> is seen

 

                                                // process <OL> tag

else if ch is the beginning of <OL> | <ol> character series

                                                then create a new subgroup and catch the content until </OL> | </ol> is seen

 

// process <DIR> tag

else if ch is the beginning of <DIR> | <dir> character series

                                                then create a new subgroup and catch the content until </DIR> | </dir> is seen

 

// process <DL> tag

else if ch is the beginning of <DL> | <dl> character series

                                                then create a new subgroup and catch the content until </DL> | </dl> is seen

 

// process <MENU> tag

else if ch is the beginning of <MENU> | <menu> character series

then create a new subgroup and catch the content until </MENU> | </menu> is seen

 

// process <P> tag

else if ch is the beginning of <P> | <p> character series

then         if number of characters betw (<P>&</P> | <p>&</p>) > MIN_P_TAG_THRESHOLD

then create a new subgroup and catch the content until </P> | </p> is seen

 

// the rest of the context of table that is not caught as a subgroup forms the

//   ‘the_rest_of_the_table’ subgroup

else if ‘the_rest_of_the_table’ subgroup is not created

                                                   then create ‘the_rest_of_the_table’ subgroup

                        put ch into ‘the_rest_of_the_table’ subgroup

                 

 

// end of <TABLE> tag processing

 

               

// the rest of the context of table that is not caught as a subgroup forms the

//   ‘the_rest_of_the_page’ subgroup

else if ‘the_rest_of_the_page’ subgroup is not created

        then create ‘the_rest_of_the_page’ subgroup

        put ch into ‘the_rest_of_the_page’ subgroup

 

 

Table 2: The Algorithm to divide a web page into contextually coherent web pages

 

Figure 3: Subpart Division of the homepage of an academician

This example has 6 subparts. 5 of them are shown with red-rectangles and the other 1 subpart is the rest of the whole page without these 5 subparts.