BILKENT UNIVERSITY
DEPARTMENT OF COMPUTER ENGINEERING
CS 491
Senior Project Specifications Report
Members
Akif BOYNUEGRI [20102125]
Hakan YILMAZ [20102300]
Ozan Ozcan DOLU [20102388]
Suleyman CETINTAS [20201015]
Abstract:
In today’s world, the World Wide Web that possesses a very rapid growth trend, has come to have billions of pages. In such a big collection of information, it has become quite hard to reach, extract, process and utilize the desired information of a particular domain. Therefore, it has become very significant to have the ability of reaching relevant regions of the web, and employing special processing techniques on the acquired, domain-specific data. Our system is a resource portal of such a particular domain, namely ‘computer science’ where a student can access educational resources, a researcher can find links related with his/her research interests, an instructor can get in touch with the most up-to-date educational resources from numerous universities.
To
be able to have the mentioned capabilities, a technique called “rule-based
focused crawling” will be employed in our system. By the usage of this
technique only the relevant regions of the WWW will be dealt with. Ignoring the
non-relevant regions will be a big gain in terms of hardware resources, in
terms of time during the crawling and information extraction stages. After acquiring this domain specific data, information
extraction methods will be used on this data to be able to support the system
with ‘browsing/keyword-searching’ & ‘querying’ utilities. Generally our
system is composed of 3 stages:
1.
Building the Document
Collection:
In this stage, the entire domain relevant web pages
will be gathered from only the relevant regions of the WWW, by the use of the
technique that is called “focused crawling”. A focused crawling technique
called “rule based focused crawling”, which has been introduced by Prof. Ozgur
Ulusoy, will be employed in the system. After gathering and storing the
document collection, the information extraction process will be employed to
effectively process the data, and it’s explained in next the section.
2.
Information Extraction:
This stage is where all the data that has been
acquired and stored by the rule-based focused crawler, is processed and made
ready for keyword searching and querying. Initially data in the stored pages is
parsed and so the valuable information [metadata] is extracted. Then this
metadata is prepared for keyword searching and querying by appropriate methods
that will be explained in the further steps of the design process.
3.
Searching and Querying the
Portal:
Once pages about computer science are retrieved and made ready, then the system will have three different ways of access from the users.
¨ Browsing/Keyword-Searching:
Users
will be able to make keyword searching by means of the usual Web searching
methods.
¨ Querying:
Users will be able to have flexible querying options with the use of a user-friendly graphical user interface. The results will be presented to the user in a ranked fashion since there will rarely be exact answers to the queries.
¨ Web
Services:
Some web-services that can be requested by some
web-clients will be included in the system. So the system will behave as the
server for these services. And this facility will enhance the usability of our
system.