BILKENT UNIVERSITY

DEPARTMENT OF COMPUTER ENGINEERING

 

 

 

CS 491

Senior Project Specifications Report

 

 

 

 

Members

Akif BOYNUEGRI [20102125]

Hakan YILMAZ [20102300]

Ozan Ozcan DOLU [20102388]

Suleyman CETINTAS [20201015]

 

 

 

 

Computer Science Portal

Abstract:

 

            In today’s world, the World Wide Web that possesses a very rapid growth trend, has come to have billions of pages. In such a big collection of information, it has become quite hard to reach, extract, process and utilize the desired information of a particular domain. Therefore, it has become very significant to have the ability of reaching relevant regions of the web, and employing special processing techniques on the acquired, domain-specific data. Our system is a resource portal of such a particular domain, namely ‘computer science’ where a student can access educational resources, a researcher can find links related with his/her research interests, an instructor can get in touch with the most up-to-date educational resources from numerous universities.

 

            To be able to have the mentioned capabilities, a technique called “rule-based focused crawling” will be employed in our system. By the usage of this technique only the relevant regions of the WWW will be dealt with. Ignoring the non-relevant regions will be a big gain in terms of hardware resources, in terms of time during the crawling and information extraction stages. After acquiring this domain specific data, information extraction methods will be used on this data to be able to support the system with ‘browsing/keyword-searching’ & ‘querying’ utilities. Generally our system is composed of 3 stages:

 

1.      Building the Document Collection:

In this stage, the entire domain relevant web pages will be gathered from only the relevant regions of the WWW, by the use of the technique that is called “focused crawling”. A focused crawling technique called “rule based focused crawling”, which has been introduced by Prof. Ozgur Ulusoy, will be employed in the system. After gathering and storing the document collection, the information extraction process will be employed to effectively process the data, and it’s explained in next the section.

 

2.      Information Extraction:

This stage is where all the data that has been acquired and stored by the rule-based focused crawler, is processed and made ready for keyword searching and querying. Initially data in the stored pages is parsed and so the valuable information [metadata] is extracted. Then this metadata is prepared for keyword searching and querying by appropriate methods that will be explained in the further steps of the design process.

 

3.      Searching and Querying the Portal:

Once pages about computer science are retrieved and made ready, then the system will have three different ways of access from the users.

¨ Browsing/Keyword-Searching:

Users will be able to make keyword searching by means of the usual Web searching methods.

¨ Querying:

Users will be able to have flexible querying options with the use of a user-friendly graphical user interface. The results will be presented to the user in a ranked fashion since there will rarely be exact answers to the queries.

¨ Web Services:

Some web-services that can be requested by some web-clients will be included in the system. So the system will behave as the server for these services. And this facility will enhance the usability of our system.