BILKENT UNIVERSITY
DEPARTMENT OF COMPUTER ENGINEERING
CS 491
Senior Project Analysis Report
Members
Akif BOYNUEGRI [20102125]
Hakan YILMAZ [20102300]
Ozan Ozcan DOLU [20102388]
Suleyman CETINTAS [20201015]
1-
Introduction....................................................................................................................4
2- Current System...............................................................................................................4
2.1) Current Research.....................................................................................................4
2.1.1) Building the Document Collection...................................................................4
2.1.1.1) Two Main Crawler Types..........................................................................4
2.1.1.1.1) General Purpose Crawlers..................................................................4
2.1.1.1.2) Focused Crawlers...............................................................................4
2.1.1.1.2.1) Hard Focused Crawling Approach..............................................5
2.1.1.1.2.2) Soft Focused Crawling Approach...............................................5
2.1.1.2) The Problems A Large Scale Crawler Encounter and Their Solutions.....5
2.1.1.2.1) Problems A Large Scale Crawler Encounters.....................................5
2.1.1.2.1.1) Denial of Service on Web and DNS Hosts..................................5
2.1.1.2.1.2) Ethics – robots.txt........................................................................5
2.1.1.2.1.3) Spider Traps – Pages with Dynamic Content..............................5
2.1.1.2.1.4) Revisiting Same URL..................................................................6
2.1.1.2.1.5) Downloading Same Pages...........................................................6
2.1.1.2.1.6) Pause - Resume Crawler..............................................................6
2.1.1.2.2) Solutions for the Problems A Large Scale Crawler Encounters.........6
2.1.1.2.2.1) Avoiding DOS ............................................................................6
2.1.1.2.2.2) Ethical Considerations.................................................................6
2.1.1.2.2.3) Handling Dynamically Created Pages.........................................6
2.1.1.2.2.4) Avoiding Revisiting......................................................................7
2.1.1.2.2.5) Resuming Crawler........................................................................7
2.1.1.3) Crawlers of Interest....................................................................................7
2.1.1.3.1) Baseline Focused Crawler of Chakrabarti’s……………....................7
2.1.1.3.2) Rule-Based Focused Crawler.............................................................7
2.1.1.4) Comparison of Crawlers of Interest..........................................................8
2.1.1.5) Additions to Current Crawling Systems....................................................9
2.1.1.5.1) Extensions to Rule-Based Focused Crawler.......................................9
2.1.1.5.2) Revisiting the DOS problem.............................................................10
2.1.1.5.2.1) 2-Queue System [Current System] ...........................................10
2.1.1.5.2.2) Using Multi-Queues Instead of Two Queues……………........12
2.1.2) Preparation of Data for Querying…………………………….……………..12
2.1.2.1) Basic Keyword Search with Information Retrieval…………….…....…12
2.1.2.1.1) Inverted File Index Structure…………….…………………….......13
2.1.2.2) Advanced Search with Information Extraction…………….………......13
2.1.2.2.1) Classification of Document Collection with Rainbow Library........13
2.1.2.2.2) Information Extraction on the Classified Document Sets……........15
2.1.2.2.2.1) What is Information Extraction? ………………………..........15
2.1.3) Searching and Querying the Portal……………….……………………........16
2.1.3.1) Browsing / keyword searching: ……………………………...………..16
2.1.3.2) Querying………………………………………………………………..16
2.1.3.3) Web Services:
………………………………………………...………..16
3- Proposed System……………………………………………………………….......17
3.1) Overview…………………………………………...…………………………..17
3.2) Functional Requirements…………………………………………...………….17
3.3) Non-Functional Requirements…………………………………………...…….17
3.4) Pseudo Requirements…………………………………………...……………..18
3.5) Systems Model…………………………………………...…………………....18
3.5.1) Scenarios…………………………………………...…………………..…18
3.5.2) Use Case Model…………………………………...……………………...19
3.5.3) User Interface and Screen Mock-Ups…………...…………………...…...20
3.5.3.1) Simple Search..…………...………………………………………….20
3.5.3.2) Advanced Search..…………...……………………………………....20
3.5.3.2.1) Personal Home Pages.....……………………………………..…21
3.5.3.2.2) Course Pages……….......…………………………………….…22
3.5.3.2.3) Research Group Pages....…………………………………….….23
3.5.3.2.4) Department Page…….....…………………………………….…24
3.5.3.3) Result Page………….…….....……………………………………....25
4 References…….....…………………………………………………..………….….26
1) Introduction
In today's world, the World Wide Web that possesses a tremendous growth trend, has come to have billions of pages. In such a big collection of information, it has become expectably hard to reach, extract, process and utilize the desired information of a particular domain. Therefore, it has become very significant to have the ability of reaching relevant regions of the web and employing special processing techniques on the acquired, domain-specific data. Our system is a resource portal of such a particular domain, namely ‘computer science’ where a student can access educational resources, a researcher can find links related with his/her research interests, an instructor can get in touch with the most up-to-date educational resources from numerous universities.
1) Current System
The project is composed of 3 main stages, namely “building the document collection”, “preparation of data for querying” and “searching and querying the portal”. Firstly, the current researches of interests about these three parts will be mentioned about, and then some extensions to some of these researches will be proposed.
2.1) Current Research
In this part, the researches about all three parts of our project are explained under 3 main titles.
2.1.1) Building the
Document Collection:
In this stage, the entire domain relevant web pages will be gathered from the web. While doing this, only the relevant regions of the web will be dealt with and the ignorance of the non-relevant regions of the web will be a big gain in terms of hardware resources and in terms of time during both crawling and information retrieval & extraction processes of the project.
The task of gathering all and only relevant pages of interest [i.e. computer science resources] from only the relevant regions of the web is accomplished by “focused crawling”, introduced by Chakrabarti. However Chakrabarti’s crawler possesses the inability to model tunneling, so a new crawler, namely “rule based focused crawler” introduced by Prof. Ozgur Ulusoy, will be used in our system. Now, let’s introduce some types of crawlers and the one that will be used in our system:
2.1.1.1) Two Main Crawler Types:
2.1.1.1.1) General Purpose Crawlers:
These kinds of crawlers use the breadth-first strategy during the crawling phase. They have no limitation on extracting URLs from the downloaded pages and no limitation in following these URLs. So with the simply straight forward approach, they follow all the links extracted from a downloaded page using the breadth-first strategy. However, since in our project only the relevant regions of the web will be dealt-with, a general-purpose crawler is not applicable to accomplish this task.
2.1.1.1.2) Focused Crawlers:
In order to deal with only some regions of the web, a crawler should have some limitations on whether to follow some kind of URLs or not. Focused crawlers follow this idea. They have some rules on following the URLs extracted from the downloaded pages, and they decide on following a URL or not, depending on their rules. Let’s introduce two kinds of approaches used in focused crawlers:
2.1.1.1.2.1)
Hard Focused Crawling Approach
In this approach, if a downloaded page is off-topic –i.e. if the rules of the focused crawler imply that this page is out of interest-, then all the links extracted from this page will not be followed, and crawling will be stopped at his page.
2.1.1.1.2.2)
Soft Focused Crawling Approach
Soft Focused Crawlers doesn’t stop crawling even if a page is out of interest, instead they assign the relevance value of the page to every URL extracted from it, and send these URLs to the priority queue –the structure first of which is the next URL to be followed-. So if the relevance value of a page is very low, then the links extracted from that page are assigned very small relevance values and so are sent to the backwards of the priority queue.
2.1.1.2) The Problems A Large Scale Crawler Encounters and Their Solutions
2.1.1.2.1)
Problems A Large Scale Crawler Encounters
2.1.1.2.1.1) Denial of Service on Web and DNS Hosts
Web hosts usually have no objection to crawling since its beneficial fro them as well to be listed in search pages. However they do not want their resources to be consumed by crawlers. A crawler should not overload a single web server but alternate between millions of servers in internet or its activity is perceived as a DOS (Denial of Service) attack and the host complains or bans the crawler.
DNS servers face the same kind of problems. They are already very busy servers since every internet user requests the resolution of an URL before accessing it from these servers. So we do not want to send very frequent resolution requests to these servers.
2.1.1.2.1.2) Ethics – robots.txt
Content providers or web masters may not want crawlers to index their web sites or a part of it for different reasons. There exist two mechanisms to provide a means for requesting the crawler not to index some pages.
a- A specially formatted robots.txt is stored in the web host to indicate which parts of the web site are not to be visited by robots.
b- The Robots Meta tag, a special html meta tag can tell the robot if the page is t be indexed or not or even be analyzed for links.
Of course the robot may or may not follow these rules and that’s why this is an ethical problem. We want our crawler to follow the rules as much as possible
2.1.1.2.1.3) Spider Traps – Pages with
Dynamic Content
Some web pages produce millions of links just to confuse and prevent a crawler from functioning properly. These pages are called spider traps. Also there are pages with dynamic content such as forum pages or CGI scripts. Forum pages may assign a session id each time the page is requested in the URL of the page and the crawler thinks its viewing a different page, With session id, actual content of the page may also change. So a crawler goes into a forum site but never comes out. This must be avoided.
2.1.1.2.1.4) Revisiting Same URL
A crawler should not revisit the same pages over and over. There must be a mechanism to determine if the URL at hand has been visited before or not. Also we do not want to ask for the IP of a host more then once to the DNS servers. So visited URL’s and resolved IP’s of hosts should be stored. However the storage and retrieval mechanism should be very fast since the crawler will deal with millions of pages.
2.1.1.2.1.5) Downloading Same Pages
Some content may be hosted by different hosts under different names or a host might be accessed with more then one name. The crawler should still be able to recognize the page and not download it again. Basically previously downloaded pages must be compared with new pages to prevent duplicate downloading.
2.1.1.2.1.6) Pause - Resume Crawler
For a certain reason we might want to stop the crawler but resume again later. This maybe for performance testing or some other reason in the computer might force you. In this case we want the crawler to save all state information, downloaded URL’s, resolved hosts and all queues so that we can resume crawling later.
2.1.1.2.2) Solutions for the Problems A Large Scale Crawler Encounters
2.1.1.2.2.1)
Avoiding DOS
For each host an entry is created that holds host name, IP of the host and a time stamp. When a new URL is encountered, it’s checked against the pre-resolved hosts. If its IP has been resolved before DNS servers are not accessed but the entry in the database is used. Furthermore if the timestamp associated with this host has not yet expired then the URL is put into a temporary queue to be used later. That way we ensure that no other requests are sent to a web host if a certain time has not passed since last request.
2.1.1.2.2.2)
Ethical Considerations
If the host has provided a robots.txt, it’s downloaded. If the URL at hand is specified to be in a restricted area then the URL is discarded.
2.1.1.2.2.3)
Handling Dynamically Created Pages
First solution that comes to mind is to restrict the number of pages that can be downloaded per host. However that might prevent the crawler to download pages that are actually useful for us. Also still same content might be downloaded under different names.A stricter but better solution is to discard pages with certain keywords in them. For example a forum has keywords like “view topic”, “add reply” in its content or “session_id=” in its URL.
In this project we are going even further and working on only text/html type pages. Since their content is static we’ll avoid most of the problems encountered in dynamically created pages.
2.1.1.2.2.4)
Avoiding Revisiting
We use a BerkelyDB to store each URL and its content we visited so we can check the database with new URL’s to determine whether to download them or not . However since that will be memory consuming and comparisons will be very slow we take 128 but MD5 checksums of both the URL’s and their contents and store only them. Using 128 bit provides a very large number of possible checksums and probability of two pages having the same checksum is very low. Comparing 128 bits is a fast operation and serves our purpose.
2.1.1.2.2.5)
Resuming Crawler
We want to be able to store the queues (URL-DNS-READ) and state information (Resolved DNS’s, Timestamps for Hosts) when we want so we can re-construct the state and resume crawling. It’s already stated that a BerkelyDB is used to store some of these information. However storing all the queues into database and reading from it slows down the crawler considerably. The implemented solution is to use both memory and the disk. While a <id , score > pair is hold for each URL in memory (score for relevance of the URL to topic) the actual records are hold in disk. So whenever wanted the memory structures can be re-constructed from disk so allowing us to resume crawling.
2.1.1.3) Crawlers of Interest
Having introduced two general types of crawlers, the crawlers that will be based on and that will be used will be introduced now.
2.1.1.3.1) Soft Focused Crawler [Baseline
Focused Crawler of Chakrabarti’s]
The crawler introduced by Chakrabarti is a type of soft focused crawler. It has a “learning component” that is trained to assign a relevance value to a page depending on its relevance to the target topic. Then all the URL’s extracted from this page are assigned that relevance value and are inserted to the priority queue with this relevance value. So the main idea under this crawler is the assumption that “if a Web page is classified as relevant to a topic of interest, so are all the links extracted from this page”.
This crawler is initially given a quite big set of web pages, e.g. 260.000 pages from Yahoo. Then using “the extended Naïve Bayes classifier Rainbow”, this crawler is trained to recognize the taxonomy of an example page. Once the classifier constructs its internal model, then any page given to this model can easily be determined to be an on-topic –a page of interest- or an off-topic page.
However this kind of crawler has the inability to recognize the on topic pages, hidden behind one layer of off-topic pages. This inability of Chakrabarti’s crawler is overcome with the “rule based focused crawling approach”.
2.1.1.3.2) Rule-Based
Focused Crawler
This crawler, proposed by Prof. Ozgur Ulusoy, has the ability to support tunneling, and so is preferred in our system. As mentioned earlier, tunneling is the ability to see online-pages behind some offline-pages. This is actually by the means of a set of rules. The crawler is given a set of documents for each class two times [the first given set of documents are called the train-0 set, and the second one train-1 set], and the learner part finds out which of the classes in the train-0 set refers to the classes in the train-1 set. By doing so, for each class the learner part generate rules of the form Ti->Tj (%X), meaning a page o class Ti can refer to a page of class Tj by the probability X. This rule creation process can be seen in Figure 1.

Figure 1: a) The learner part Chakrabarti’s Soft Focused Crawler uses
b) The second time learning of Rule-Based Focused Crawler (train-1 set)
c)
Using train-0 & train-1 set results, generating the rules
2.1.1.4) Comparison of Crawlers of Interest
To see how soft-focused crawlers and rule-based focused crawlers behave in tunneling favorable times, the following example will be helpful.

Figure 2: Class distribution of pages fetched in train-1 set, for each class of train-o set

Figure 3: Interclass rules for the train-0, train-1 sets given in Figure 2.
Assume that the crawlers are seeking for PH, and the seed page of the scenario here is a PH home page that includes 4 URLs pointing to the pages of classes “CH”, “DH”, “PH”, “SP”. And the “CH” page has a hyperlink to a “PH” page.

Figure 4: a) Soft-Focused Crawlers’ Behavior
b) Rule-Based Focused Crawlers’ Behavior
Given the 4 seed URL, the soft-focused crawler will extract all of the 4 URLs in the seed page, insert them into priority queue and assign the relevance value of 1 to each of the seed pages –since it’s also a page of target class-. After extracting URL1, the “CH” link extracted from the URL1, will get a relevance value of less than 1 from the classifier, since it’s not a PH page. Therefore ultimately it will be sent to the end of the priority queue as shown in Figure 4, Part a.
On the other hand, in the rule-based-focused crawler’s case, all the seeds will be assigned the value 0.3 since PH->PH (%30) according to its rules. Then after extracting the URL1, it will assign the value of [CH->PH (%40)] 0.4 to the URL5 which will ultimately put it in the first place of the priority queue, as shown in Figure 4, Part b. And extraction of this will be a reward since the CH page refers to a PH page, a target page.
Using transitivity among the rules is another possibility of supporting tunneling. For instance, from Figure 2, DH->PH (%10) directly. Using transitivity DH->CH (%80) & CH->PH (%40) DH->PH (%10 + %80*%40 = %42). So since soft-focused crawlers won’t be able to see such a path, our rule-based focused crawler’s ability is a big gain, and therefore it will be used in our system with some extensions.
2.1.1.5) Additions to Current Crawling Systems
2.1.1.5.1)
Extensions to Rule-Based Focused Crawler
We plan to test the effectiveness and efficiency of an extension to the rule-based focused crawler in our project. It’s a fact that relevance values of page is to be considered for all the URL’s extracted from that page, namely if PH->CH(%30) then all the links have this possibility from that PH. However in most of the pages, the links on a page are grouped according to some criteria. And if we are able to resolve these groups and make use of the groups’ characteristic Ti->Tj (%X) rule, then the accuracy of our results will be much higher.
For
instance in the ‘Personal Homepage [PH]’ of Prof. Ozgur Ulusoy, all links are
grouped accordingly with each other. So for instance a rule PH->DH (%X) for
the whole page and a rule PH->DH (%Y) for only the “Biography” section of
the PH will be very different from each other.
Actually X is (4/36 =) %11 and Y is 3/9 = %33. Clearly, Y is much bigger
than X with our approach, since the whole page has many links different than
DH, and the DH/all URLs ratio is smaller than our case, in the whole page.



Figure 5: Home Page of Prof. Ozgur Ulusoy
2.1.1.5.2)
Revisiting the DOS problem
2.1.1.5.2.1) 2-Queue System [Current System]
One problem while crawling the web is that many pages would be downloaded from the same host in a short time period. Certainly web hosts do not want their resources to be consumed by crawlers that will prevent actual consumers to access their pages in a fast and efficient manner. So repeatedly requesting pages from the same host in a short period of time is interpreted as a DOS (Denial of Service) attack and the web server might block the crawler from accessing its pages. We do not want our crawler to cause such a problem and certainly we do not want our access to be blocked to web hosts. The solution to this problem is simple. Crawler should not send a request to a web host if more then a certain period of time has not yet passed since its last request to this same server.
Initial approach was to maintain two queues to
prevent DNS like behavior. First queue was the main URLQueue and the second one
was the temporary
queue which stored the URL’s of pages that belonged to busy hosts. Busy host
means that a page has been downloaded from it and the timestamp associated with
this host has not yet expired. A timestamp of 60 seconds have been used in
implementation. The accessing order of the queues is simple. If the last
extracted URL was obtained from the main queue, the next URL will be extracted
from the temporary queue and if it was from the temporary queue next one will
be extracted from the main queue. Flow of URL’s is shown in figure 7.
This two-queue method is a simple but
working solution, however it proved unpractical. There are two main reasons for
that.
1-
Web pages nearly always have more links to the same host then to external
hosts.
·
http://www.bilkent.edu.tr/bilkent/general/index.html
·
http://www.bilkent.edu.tr/bilkent/academic/admission/index.html
·
http://www.bilkent.edu.tr/image/home_page_footer.gif
As an example if the Bilkent main page is downloaded a total
of 20 links in this page has to be put into the temporary queue since www.bilkent.edu.tr’s timestamp has just
been initiated and these 20 links cannot be yet accessed. This fact causes the
temporary queue to enlarge very fast.
2- Of these 20 links there might be
some very good links for the focused approach (i.e. pages with high relevance
values) but since they stay deep in the queue we have to pursue less relevant
pages. This problem showed that the temporary queue has to be a priority queue.
3-
Maintaining a queue of thousands of items becomes problematic because this
queue isn’t a FIFO queue but a priority queue. The temporary queue has to be a
priority queue as the main URLQueue is, since we want to pick the most relevant
pages to our topic first. So every time a new entry is added, the queue has to
be reordered so that the most relevant pages stay in front while less relevant
pages stay in the bottom.
4-
Let us say we have just extracted 10 very promising links
from a page. We put them in to the temporary queue. Since their relevance score
is high, they make their way up to the root of the queue. Next time we access
the temporary queue we have to pick each of these links and put them back into
the bottom of the queue (since timestamp for this host hasn’t yet expired)
until we find a link with an accessible host. Next time we access the temporary
queue we will find the same 10 links on the top of the queue again. These types
of links easily go up to hundreds and thousands forcing us to handle the same
links again and again which is resource and time consuming. The problem is
illustrated in figure 6.

Figure 6:
Problem in Priority 2-Queues
2.1.1.5.2.2)
Using Multi-Queues Instead of Two
Queues
The idea is to have an array of queues. Each element in the array is a pointer to a queue dedicated to a single host. Let us call this array the HostArray. A thread fetches an URL from the main URLQueue. If the host of this URL is busy, the URL is put to the respective queue for this host. Then the thread makes a pass over the host queues and among available hosts picks the one with the highest score of relevance. Then the corresponding record is taken from the database (usage of the database is explained in the next paragraph) and given to the DNSQueue. Then this host is set as busy again. If there is not a queue for a host, a new queue is created. If the HostArray is already full we can delete the queue with least relevance values and create a new queue for the new host. Or a queue can be stored to the database and swapped later if needed.
The items in the queues do not hold all record information but simply a <record_id , score > pair. The actual records are stored in a BerkeleyDB. This way we can create very long queues in memory without size problems since the id-score pair requires little memory. This method also eliminates the need for a temp queue which grows very fast and large and also has other problems as describes in section 2.1.1.5.2.1. Figure 7 shows the flow of URL’s in the multi-queue system.

Figure 7: Flow of queues in a multi-queue.
2.1.2)
Preparation of Data for Querying:
2.1.2.1)
Basic Keyword Search with Information Retrieval
To be able to make simple keyword based search on the “document collection”, an “inverted file based search approach” will be used. This approach requires the usage of a file structure called “inverted file index”. Using this approach, the user will enter a set of terms as the query, and then the system will find the best N documents with relevance values from highest to lowest. The inverted index file structure of the mentioned approach is explained in the following section.
2.1.2.1.1)
Inverted File Index Structure
Inverted index file structure holds the information of “all the terms appearing in all documents, the documents that a term appears in and how many times that word appeared in that document”. According to NIST [National Institute of Standards and Technology], an inverted index holds also the positions of the words in the texts. But in our search approach “inverted file index” will be used and it doesn’t hold the positions of the words in documents.
An inverted
file index structure
term_id <document_id, term_weight> *
1
<3,
5> <6, 7><29, 3>
2 <1, 10><16, 9><62, 1><186, 5>
3 …
.
.
.
N <1, 10><3, 20><6,
1>
The first number if the term_id corresponding to a word, then the <document_id, term_weight> pairs show the information of which documents the word with term_id occurs in with what term_weight frequency.
Actually the use of inverted index file structures base on the fact that for basic keyword based search the number of terms to be searched are quite low. Then instead of searching the terms in all of the documents, usage of inverted index makes it possible to just access the terms and their corresponding <document_id, term_weight> pairs to be searched for and ignore the rest of all the terms. Since the ratio of “the number of terms to be searched for” to “the number of all terms in all documents” is typically very low, searching time will be reduced significantly. Moreover, since the inverted index files are stored on the disk, the amount of memory required by the system also decreases. And inverted index file is usually a binary file. To mention without loss of generality, this technique ensures favoring sequential reading and so decreases the time requirements during the disk accesses.
2.1.2.2) Advanced Search with Information Extraction
2.1.2.2.1) Classification of Document
Collection with Rainbow Library
Rainbow is a text classifier program that is based on Bow library. The working mechanism of Rainbow:
· Indexing documents and words. While applying indexing, words that are used frequently in English can be omitted. It also contains HTML parser and stemming option again for English. HTML parser cleans unnecessary parts like tags in the HTML documents and purifies documents. Stemming is ability to relate words with prefixes and suffixes to their own stem.
· Classification of documents. This step contains two phase. First phase is training of the dataset. Second phase is testing phase. While performing training phase, documents indexed must include all training data. Testing data can be sub-part of these indexed documents.
Rainbow supports four classification methods. These are:
¨ Naïve Bayesian
¨ K-Nearest Neighbors (KNN)
¨ TFIDF (term frequency/index document frequency)
¨ Probabilistic Indexing
Naïve Bayesian: is the classification algorithm that is based on Bayesian algorithm and it is generally used for the inputs with high dimensions. This algorithm is a simple probabilistic classifier and it is based on probability model that assumes strong independence between attributes. In other words, algorithm ignores the interactions between attributes in the same class and this is the reason for why it is called ‘Naive’. This model also called independent feature model. The main idea behind the Naïve Bayesian algorithm is to compute joint probability and category and to estimate category of document. Posterior probability of document for each class is computed according to Bayes rule and document is assigned class with highest probability.
K-Nearest Neighbors: searches for K nearest neighbors in we want to classify, then algorithm classifies the majority class of these nearest neighbors. Although this algorithm works well on real domain, its performance is the problem for KNN. These neighbors are determined according to distance between documents pairs. Distance can be measured by Euclidian distance or cosine similarity.
TFIDF: is the relevance feedback method introduced by Rocchio. The main idea is that the term in the document that has high frequency inside the document and low frequency outside the document is more likely to be relevant to topic of this document. So the term weight is computed according to its term frequency multiplied with inverse of its document frequency.
Probabilistic Indexing: In this model according to difference in the distributional behavior of words, the word is assigned as an index term or not. The main idea is to get likely relevance of documents based on occurrence and co-occurrence of the words in the document. In other words, algorithm indexes the words with the probabilistic term weights based on the relevance of terms to queries given to the system.
Therefore, in our project we are planning to use Rainbow library with Naïve Bayesian method to classify the HTML pages in order to understand category that document belongs to. These categories are, as it mentioned, course page, personal home page, research project page, department page and tutorial page.
2.1.2.2.2) Information Extraction on the
Classified Document Sets
First stage of the project was information retrieval. Web pages are crawled based on certain rules and crawler downloads pages that are of interest to us which are pages that have relevant information to computer science. Once the pages are downloaded all data has to be prepared for searching. We try to implement two kinds of searches. First one is keyword searching where user can request pages that have the inputted keywords on them. An inverted index, as explained previously, is maintained for this purpose. The other kind of search is done on specific database tables that will be constructed to answer SQL type queries. For example a user may want to find course pages related to database design or names and web pages of researchers that have publications about machine learning. To be able to answer such queries we will extract unstructured information from downloaded web pages and store them in a structured database. This is the information extraction stage.
With the rapid growth of World Wide Web the need for extraction of specific information from web pages increased. Information extraction is the process of analyzing natural language and collecting information about specified types of entities, relationships, or events. In this project we will try to extract named entities and construct relations between them. For example to be able to determine the name and e-mail of the instructor in a course web page we will have to mark Prof Dennis as the instructor and dennis@something.com as his e-mail address.
The difficulty in the task is that information is not recorded in a standardized manner meaning the same information can be expressed in different ways.
a-Mike broke
the window. I saw him run way.
b-I saw Mike
run away after breaking the window.
Both sentences convey the same information although they are structured differently. Now in these sentence an information extraction (IE) system should be able to realize these points.
“window” is an object
“Mike” is a name
“broke” is an action
“run way” is an action
“him” corresponds to “Mike”
“Mike” committed “broke” and “run away”
What we are trying to achieve is very similar to this example. A course page might have the desired information but not in a formatted way. For example
b-Instructor: Prof Dennis
e-mail
: dennis@something.com
In this page we want to be able to extract instructor name and his e-mail whether format a or b or even a completely different format is used so that we can store this information in database tables and make it ready for querying.
There are several information extraction algorithms. Some of these are Rapier, BWI (Boosted Wrapped Induction), LP2 and BIEN.
Rapier algorithm uses inductive logic programming techniques to determine extraction rules for specific fields. Rapier algorithm does not distinguish the start or end tag of string, it tries to identify string its entirety. Algorithm starts to search more specific rules to general rules on training data. Then by repeating these rules, it tries to generalize these rules for. Rapier has a feature like tokens, part of speech information and some semantic class information.
BWM algorithm is based on learning wrapped patterns and by boosting these patterns are combined. In order to learn separate model, it identifies start and end tags. To estimate accuracy of pairing the start end tag, histogram of training fragment lengths. BWI features are actual tokens, orthographic generalizations (capitalized, punctuation etc.) and some lexical knowledge.
LP2, in order to identify start and end tags, learns symbolic rules. Like BWI, it identifies start and end tags separately. LP2 features are tokens, orthographic generalizations and shallow linguistic information. In addition there is supplementary learning correction rules.
BIEN algorithm learns model by using dynamic Bayesian network. BIEN has the most sophisticated features compared to other three algorithms. BIEN has feature of token, orthographic generalizations and it uses chunking, a gazetteer and semantic.
We will determine which information extraction algorithm we are going to use in the future stages of our project.
2.1.3)
Searching and Querying the Portal:
There are three ways to access the data in our project. These ways differ through the needs of users.
2.1.3.1) Browsing / keyword searching:
This is the most widely used method
in Internet searches. Most of the search portals [such as Altavista, Google
etc.] use this method as simple search that browses through the data using the
given keyword and presents the result to user using a simple typical GUI.
2.1.3.2)
Querying
In our project one of the most important parts we will implement, that the common search portals do not have, is the data tables which contain extracted data from the web resources by using information extraction techniques. This will allow users to issue sophisticated SQL queries that will help users to get more accurate and faster results. The portal will also rank the results, higher relevance to lower, because of the impossibility of having 100% accurate results all the time and it will do this with respect to similarity functions and the confidence score of the extraction tool which we call relevance.
2.1.3.3) Web Services:
Our project’s last part will be to
provide Web services that can be used by other applications. A useful example
of this will be getting homepage of a person whose his name is given to the
server. It seems to be very useful because this kind of a server will help
other applications that try to get the updated version of the given people’s
homepages. Also using it on course homepages may help the instructors to
automatically reach desired web sites by adding this function to his server
instead of searching them manually.
3) Proposed System
3.1)
Overview
First of all, creating a web server that will be a guide to all people from professional to academic scale is a new and original task to be found on web. All the search portals on web can help people to a point but it is likely that they will give hundreds of unrelated pages as results in addition to a couple of actually desired ones. Our project overcomes this by carefully extracting only CS related sites from the world wide web and by giving users results found around these CS sites.
Our proposed system will
accomplish its task with the use of rule-based focused crawling approach with
an extension in the “building the document collection phase”, information
retrieval method namely “inverted-index-based search” for the basic-keyword
search, information extraction methods for the advanced search.
3.2) Functional Requirements
3.3) Non-Functional Requirements
In order to support this kind of website we need a fast server that will
not make user aware of the hardness of the search. Also the first phase that
contains constructing the database we will need a real super computer to get
adequate results in an acceptable time period. We are promised that we will
have a computer that has dual processor and 4 GB RAM so that will surely do the
work. In software aspect we will use SQL for our database.
3.4) Pseudo Requirements
Although not directly dictated by the project, some new ideas and technologies are thought to be implemented in this project. Here is a list of them:
● In the first [building document collection] phase of the project, rule-based focused crawling approach will be extended. In the extended version instead of pages, some parts of pages will have Ti->Tj (%X) rules and this approach is thought to give more accurate results.
● In the second
phase [information extraction] of the project, the databases holding the
extracted information will hold relevance values for all the attributes of all
tuples, instead of holding just one relevance value for a tuple. In the old
approach, if some attributes of a tuple possess low relevance values, then the
general relevance value of the tuple will be low. However if the user doesn’t
deal with the attributes whose relevance values are low, then although this tuple
is potentially a target tuple for that user and it will be in lower levels of
the result page because of its low tuple relevance value. On the other hand, in
our approach, the tuple’s general relevance value is calculated with the
relevance values of the attributes that are searched for. In this way, the
tuples will be ranked for that specific search, and the mentioned deficiency
won’t be faced.
3.5) Systems Model
3.5.1) Scenarios
Scenario 1 (Simple
Search):
A student visits the portal and types “stack” to the simple search part then he will get the pages those contain word “stack” in relevance order, highest to the top.
Scenario 2 (Course
Page Search):
An instructor giving “Algorithm I” lesson
in
Scenario 3
(Personal Home Page Search):
A graduate student from
Scenario 4
(Research Group Page Search):
A
PhD. Student clicks “Research Group Page” button hoping to find some current
researches on his PhD assertion topic “Information Extraction”. In the second
page he writes “Information Extraction” to “Research Subject” part then he will
find out related web pages.
Scenario 5 (Department
Home Page Search):
A last year high school student eager to
become a computer engineer, currently trying to decide which university he
should join, clicks “department home page” button to find the address of CS
departments in
3.5.2) Use Case Model

Figure
8: Use Case Model for the System
3.5.3) User Interface and Screen Mock-Ups
3.5.3.1)
Simple Search:

Figure 9: Simple Search
By using this page user can perform simple search. It is also possible for user to reach advance search option in this page. These options are search in ‘Course Home Pages’, ’Personal Homepages’, ’Research group Pages’ and ‘Department Pages’.
3.5.3.2)
Advanced Search:
In the advance search option, each domain specific search page contains fields that you can make structured information retrieval. Each search page performs querying on connected table. For example, course home page will perform its retrieval on CoursePages table of database. Also fields in the advance search are representations of columns of related table.
3.5.3.2.1) Personal Home Pages:

Figure 10:
Personal Homepage Search
In this page, user can make search on personal homepages that are related to computer science. There are six search fields that make user to get more structured information. These fields are ‘Person’s Name’, ’Research Interests’, ’Mail Address’, ’Academic Memberships’, ’Phone’ and ‘Courses given’. There is also ‘Exact’ option for each field. If exact option is selected, query is evaluated according to the text written in that field. The default option is likely search for each field. It is also possible to mark more than one field. Exact option is included in every advance search domain so it is not necessary to re-explain it. User can also choose the number of top query results by using drop-down button. There are options from 10 to 50. When user pushes the ‘Advance Search’ button results of this query will be shown.
3.5.3.2.2) Course Pages:

Figure 11:
Course Page Search
By using this page you can perform
search on course homepage domain. There are six search fields which are ‘Course
Name’, ‘Instructor Name’, ’Text Book’, ‘University Name’, ’Instructor’s E-mail’
and ‘Description’.
3.5.3.2.3) Research Group Pages:

Figure 12:
Research-Group Page Search
From this page user can perform search on research group pages. There are five fields for searching, these are ‘Member Names’, ‘Research Subject’, ‘Publications’, ‘Research Interests’ and ‘E-mail Addresses’.