
Bilkent University
Department of Computer Engineering
Senior Project
Computer Science Resource Portal
By
Süleyman
ÇETİNTAŞ
Akif BOYNUEĞRİ
Hakan YILMAZ
Ozan Özcan DOLU
Supervisor:
Özgür ULUSOY
Final Report
This
report is submitted to the Department of Computer Engineering of Bilkent
University in partial fulfillment of the requirements of the Senior Projects
course CS492
Content
2.1 Rule-based Focused Crawler Overview:
2.2 Extensions for Rule Based Focused Crawler
2.2.2 Sub Document Classification
3 Processing Document Collection
3.1 Classification of Web Pages
3.2 Processing Data for Basic Keyword Searching
3.3 Processing Data for Advanced Search with Information
Extraction
3.3.1 Information Extraction with SRV
Computer Science Resource Portal aims to provide students, researchers and instructors to access the most up-to-date educational resources from the internet. CSRP offers facilities to conduct advanced search and simple keyword based search on computer science related web pages.
Computer Science Resource Portal consists of three main stages. These stages are; document collecting, processing the document collection, searching and querying of the portal.
· In the document collecting phase, the entire domain relevant web pages are gathered from the Web. The task of gathering all and only relevant pages of interest (i.e. computer science resources) from only the relevant regions of the Web is accomplished by focused crawling which is introduced by Chakrabarti [1]. However Chakrabarti’s crawler possesses the inability to model tunneling, so a new crawler, namely rule based focused crawler that is recently proposed [3], is used in our system. We have debugged the crawler that we possess for a long & intensive period for some errors that will be mentioned later. And we have improved the focused crawling approach by focusing sub-document parts that will be explained later.
· In the document processing phase, the documents that we gathered in the document collection phase have been prepared in order to be used in the querying phase. For simple search, we have built a document vector and an inverted index out of the documents. For structured search, we have classified the documents into four categories, namely course home pages, department home pages, personal home pages and research pages. A stochastic real valued unit learning algorithm (SRV) [10] has been used in each of these classification categories to extract valuable information [metadata]. Extracted information has been inserted into database tables created respected to classification categories.
· In querying phase, users are able make simple search and advanced search on downloaded documents via a user friendly interface. Simple querying is performed by Full Search (FS) algorithm. Advanced search is performed on the structured data that is stored in database by using database facilities.
For gathering documents from the Web we used Rule-based Focused
crawler that is formerly implemented despite the fact that it could not meet
the criteria required by our system. So, we propose sub-document classification
feature as an extension for rule-based focused crawler and fix some bugs in the
program. When we got the crawler, it was downloading 500-600 pages in average.
After fixing the bugs in the code, now we achieved to improve it so that it
downloads up to 30000-40000 pages. Fixing bugs for the crawler has been a quite
challenging task, because the crawler is a big multithreaded program and
debugging the crawler has taken much more time than it had been expected.
As our crawler provides online data
that has recently been downloaded from the internet, we use another source of offline
web documents as our second document collection. A large data-set downloaded
from Google Education (‘.edu’) Domain that contains approximately 800.000
documents.
Architecture of the crawler that we used in our document collecting phase is seen on Figure 2.

Figure 2- Flowchart of Crawler [11]
Current Functionalities of Crawler:
1. Offline-Online Crawling: The crawler downloads
web pages from the internet via following the links that are extracted from formerly
downloaded pages. The downloading process begins with downloading some seed
web-pages and following the links that are extracted from them.
In addition to that, already downloaded web
pages are used for crawling simulation which is called ‘offline crawling’.
Offline crawling option is used especially for debugging and testing purposes.
2. Multiple Crawling Modes: Crawler also
supports multiple crawling modes. These are:
3. Sub-Document Classification: This
option is used in soft focus crawling option and rule based focus crawling
option. First web page is partitioned into logical sub parts. Instead of
assigning same classifier score to whole page; by this option, these sub-parts
are classified and links from these sub-parts are put into priority queue with
the scores of their respected sub-parts. So the focus is not on the whole page,
but on the sub-parts of a page with this approach.
4. Politeness: Some web servers automatically
disallow IP’s those send lots of http request at a certain period of time. In
order to prevent crawler to disturb web servers that causes server to ban the
crawler’s IP, politeness is used. Politeness guarantees that a second request
sent to a web server will be after a certain amount of time of the first request
that is sent to the same server.
5. IP Restriction: By using IP
restriction, only the pages that are in a given IP range are downloaded. This
will allow crawler to download pages from one host.
6. Resuming Crawling: If crawler stops for
any reason, like some problems on server, program is enabled to continue
crawling by maintaining last state.
The crawler that we used in our system had some bugs initially. To be able to use it in our system, we had to make a long and intensive debugging. The following is a short summary of the bugs that we encountered and how they are fixed.
Database sizes: We reduced the high space requirement of the databases that hold the links that are extracted from the downloaded pages. By reducing the size of database nodes for each URL, size of the databases holding these URLs is reduced by the same factor. Also, instead of allocating fixed space for each database node dynamic memory allocation can be done but this is implemented yet.
Links in Frame Tags: Previous version of program was not extracting the links in the “frame” tag. We modified the HTML parser in the program to catch these links too. Some index pages only contain frames, so classification results of these empty pages would be low. Thus the links in a page that contains frames are put into priority queue with its grandparent’s score to avoid this situation.
Excessive Header Size: If header size of requested URL is more than header size limit, program used to terminate itself with segmentation fault signal. We have avoided this situation that is frequently encountered by putting a threshold value on the size of header.
This change remarkably increased the number of pages that are downloaded at a time.
Program stops crawling: After conducting some intensive tests on the crawler, we observed that crawler stops to download pages after it downloads a number of pages. We had initially thought that the threads that are responsible for the crawling task fall into a deadlock situation or stuck an in infinite loop. However after some tests, we have discovered that this situation occurs as a result of the blocking “read” system call. This minor error will be got rid of in a short period of time. And we think that this is the only error that is left in the crawler, as out of the last 9 run sessions we could only observe the same error. Therefore the debugging process of the crawler is nearly complete.
We have improved the focusing capability of the rule based focused crawler that we posses, by introducing a sub-document classification algorithm and embedding it into the classification phase of the crawler. Subdocument classification procedure is explained in this section.
WWW has many web pages which are of many types. In our approach only HTML based web pages are taken into consideration. The tags that are used to divide the web pages into subparts are introduced in Table 1 and explained in the following sections.
|
Exp. |
HTML Tag |
Tag Explanation |
|
Group of Tags for Lists |
<UL> |
Unordered List: Used as <UL> <LI>* </UL>. |
|
<OL> |
Ordered List: Used as <OL> <LI>* </OL>. |
|
|
<DIR> |
Directory List: Used as <DIR> <LI>* </DIR>. |
|
|
<DL><DT><DD> |
Definition List [<DL>], Definition Term [<DT>], Definition Definition [<DD>]: Used as <DL>(<DT><DD>)*</DL> |
|
|
<MENU> |
Menu item list: Used as: <MENU>(<LI>)*</MENU> |
|
|
Tag for Table |
<TABLE> |
Tables: Used as <TABLE><TBODY> (<TR><TD></TD></TR>)* </TBODY></TABLE> |
|
Paragraph Tag |
<P> |
Paragraph: Used as: <P></P> |
Table 1: HTML tags that are utilized in dividing web pages into
subparts
The group of tags for lists is a feasibly appropriate group to be utilized for dividing a page into contextually coherent subparts. Since a ‘list’ generally lists ‘related items’, it is normal to think of all the items in a list to be contextually coherent. Therefore in our system, all the lists are taken as a subgroup of a page.
Paragraph tag is also used for the identification of contextually coherent subparts of web page. Since a paragraph consists of a set of related sentences that are usually of the same topic, it’s a feasible idea to take them as a subgroup of the page.
Although utilization of the paragraph tag seems to be perfect in theory, it’s not in practice. The HTML code generation programs tend to put paragraph tag for many purposes out of the way it’s thought here. The usage of a threshold for the numbers of words in the ‘paragraph’ works to get rid of all of these usages of paragraph tag. Although the value of the threshold may be specified experimentally, it’s feasible to have it as 500. When this threshold value is too small, then very small paragraphs can be taken as a subgroup and classification of these tiny groups is mostly unsuccessful. In the case of a big threshold, the paragraphs that should be taken as subgroups, i.e. the paragraphs whose number of words is enough for a correct classification but is below the threshold, are not identified in a web page, so the utilization becomes low.
Utilization of Table Tag:
Another tag that is appropriate to be used in the work of dividing a page into contextually coherent subparts is the table tag. A table normally contains related items in its rows and columns.
However tables are also used for formatting purposes, just to keep unrelated parts of pages together. In these cases since the items that are hold in the rows and columns of the tables are unrelated, they should not be taken in the same subgroup. Therefore it’s crucial to differentiate between these different usages.
To differentiate the different usages of tables, it’s a good way to take every table as a whole web page and try to divide it in this way. In this way even though a table is used only for formatting purposes, since it’s taken as web page and parsed in this way, there will be no problem.
The Algorithm to
Divide a Web Page into Contextually Coherent Subparts
|
//
content array holds the web page context for each character ch in content // process <UL> tag if ch is the beginning of <UL> |
<ul> character series then create a new subgroup and
catch the content until </UL> | </ul> is seen // process <OL> tag else if ch is the beginning of <OL> |
<ol> character series then create a new subgroup and
catch the content until </OL> | </ol> is seen // process <DIR> tag else if ch is the beginning of <DIR> |
<dir> character series then create a new subgroup and
catch the content until </DIR> | </dir> is seen // process <DL> tag else if ch is the beginning of <DL> |
<dl> character series then create a new subgroup and
catch the content until </DL> | </dl> is seen // process <MENU> tag else if ch is the beginning of <MENU> |
<menu> character series then create a new subgroup and
catch the content until </MENU> | </menu> is seen // process <P> tag else if ch is the beginning of <P> |
<p> character series then if number of characters betw (<P>&</P> |
<p>&</p>) > MIN_P_TAG_THRESHOLD then create a new
subgroup and catch the content until </P> | </p> is seen // process
<TABLE> tag // also seek for all other
tags in the text of the table if ch is the beginning of <TABLE> |
<table> character series then
for each ch that is not the end of </TABLE> |
</table> character series //
ALSO SEEK FOR OTHER SUBPARTS IN THE TABLES // to handle the usage of TABLES FOR
FORMATTING PURPOSES // process <UL>
tag if ch is the beginning
of <UL> | <ul> character series hen
create a new subgroup and catch the content until </UL> | </ul>
is seen //
process <OL> tag else if ch is the
beginning of <OL> | <ol> character series then
create a new subgroup and catch the content until </OL> | </ol>
is seen // process <DIR>
tag else if ch is the
beginning of <DIR> | <dir> character series then
create a new subgroup and catch the content until </DIR> | </dir>
is seen // process <DL>
tag else if ch is the
beginning of <DL> | <dl> character series then
create a new subgroup and catch the content until </DL> | </dl>
is seen // process
<MENU> tag else if ch is the
beginning of <MENU> | <menu> character series then create a new subgroup and catch the
content until </MENU> | </menu> is seen // process <P>
tag else if ch is the
beginning of <P> | <p> character series then if number of characters betw
(<P>&</P> | <p>&</p>) >
MIN_P_TAG_THRESHOLD then create a new
subgroup and catch the content until </P> | </p> is seen // the rest of the
context of table that is not caught as a subgroup forms the // ‘the_rest_of_the_table’ subgroup else if
‘the_rest_of_the_table’ subgroup is not created
then create ‘the_rest_of_the_table’ subgroup put ch into ‘the_rest_of_the_table’
subgroup
// end of <TABLE> tag processing // the rest of the context of table that is
not caught as a subgroup forms the //
‘the_rest_of_the_page’ subgroup else if ‘the_rest_of_the_page’ subgroup is
not created
then create ‘the_rest_of_the_page’ subgroup
put ch into ‘the_rest_of_the_page’ subgroup |
Table 2: The Algorithm to divide a web page into contextually
coherent web pages

Figure 3: Subpart Division of the
homepage of an academician
This example has 6 subparts. 5 of them are shown with
red-rectangles and the other 1 subpart is the rest of the whole page without
these 5 subparts.