MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_01C6308E.C50842A0" This document is a Single File Web Page, also known as a Web Archive file. If you are seeing this message, your browser or editor doesn't support Web Archive files. Please download a browser that supports Web Archive, such as Microsoft Internet Explorer. ------=_NextPart_01C6308E.C50842A0 Content-Location: file:///C:/B2F4650C/LowLevel.htm Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="us-ascii"
1.2
Interface Documentation Guidelines.
1.3
Definitions, acronyms and abbreviations.
►
PHASE1: Document Collecting:
►
PHASE2: Processing the Document Collection:
►
PHASE3: Querying the Portal:
►
PHASE1: Document Collecting:
►
PHASE2: Processing the Document Collection:
►
PHASE3: Querying the Portal:
In Crawling part of the system, all the links coming f= rom one site are put together in the queue and it is likely to happen that our server will try to download lots of pages from one internet server in a sma= ll period of time. It will cause problem for the internet server and this may perceived as a brute force attack by that server. In order to prevent to prevent this, our program uses a politeness policy that will prevent crawler from downloading from same server one after another. Crawler will have to w= ait a couple of seconds to enter the same server again. This will slow down our program but it will prevent greater problems.
Information extraction (I= E) has received a lot of attention from computer science world recently as the nee= d to extract useful information from web pages increased. In general the text to extract information can be categorized in three: Free text, structured text= and semi-structured text. Web pages falls in the last category since HTML tags somewhat structures the text on the page. A lot of research has been conduc= ted in the area of IE and many algorithms have been proposed to deal with the t= hree kinds of text. We decided to use SRV algorithm which exploits the HTML tags= and so treats web pages as semi-structured data. Choosing SRV as our algorithm = we lost the flexibility that is a given with free text processors so SRV is li= kely to be unable to find some of the information that a free-text counterpart c= ould have detected. However SRV doesn’t need to pre-process the text and d= eal with semantics or grammatical rules and makes use of HTML tags. So it will = be much faster and less complicated.
Naming Conventions
In Java language:
In C language:
R-FC: Rule= -based Focus Crawling
IE: Information Extraction
IR: Information Retrieval
1) S.Chakrabarti, “Mining the Web Discovering Knowledge from Hypertext Data.” Mor= gan
Kaufman Publishers, 352 pages, 2003.
2)
Can, F., Altingovde,
3) I. S. Altingovde, Ö. Ulusoy, Exploiting Interclass Rules for Focused Crawling, IEEE Intelligent Syste= ms, vol.19, no.6, 2004.
4) Peter Flach and Nicolas= Lachiche. Naive Bayesian classification of structured data., Machine Learning, Vol. 57, Number 3, pages 233-269= , 2004
5) Fuhr, N. Models for retrieval with probabil= istic indexing. Information Processing and Management 25, 1 (= 1989), 55--72.
6) http://www.nist.gov/
7) http://www.cs.cmu.edu/~mccallum/bow/rainbow/
8)http://www.cs.utexas.edu/users/hyukcho/classificationAlgori= thm.html#Classification%20Algorithms
9) http://www.cs.bris.ac.uk/Publications/pub_info.jsp?id=3D2000266
10) Dayne Freitag, “Information Extraction for HTML: Application of a General Machine Learning Approach”, 1998
11) Stephen = Soderland, Learning Information Extraction Rules for Semi-structured and Free Text
12) Dayne Freitag, Machine Learning for Information Extraction = in Informal Domains
This stage includes reaching domain specific web page=
s and
downloading these pages by using rule based focused crawling algorithm. RFC
starts downloading pages with given seed pages and continues to download re=
levant
pages according to their classifier scores. In order to classify the pages =
Rainbow
Library is used. In the implementation stage of priority queue, Berkeley DB
library is used. RFC enables us to reach domain relevant (CS) pages.
In th= is 2nd phase of the project, downloaded pages in the first stage will be processed. This processing will be in two types since there will be two types of queri= es in the CSRP: information retrieval will be utilized for the simple [keyword based search] and information extraction for the advanced [SQL like] search= .
► PHASE2.1: Information Retr=
ieval:
In th= is phase all the pages that are collected in the 1st phase will be taken = as a document and will be parsed to get rid of the unnecessary tags etc. and to = get the tokens of each document. The information of which the document has which tokens will be stored in the document vector.
Altho= ugh a document vector is enough to make the simple queries [phase 3.1], a differe= nt file structure, namely the inverted index file structure will be used for efficiency. Since the number of keywords in a keyword based search is limit= ed, the use of an inverted index file becomes much more efficient as the system= can easily reach the owner documents of the keywords of the query.
► PHASE2.2: Information
Extraction:
1-Classifier: The first step in t= he IE phase of the CSRP is the categorization of the documents. The heterogeneous document set coming from the 1st phase of the system, will be categorized into homogenous sets like “course pages”, “personal homepages” etc. As explained in the high level design report, Rainbow library will be used as the classifier. Details of the rain= bow classifier can be found in the analysis report.
2- Information Extractor: Informa= tion extraction (IE) has received a lot of attention from computer science world recently as the need to extract useful information from web pages increased= . In general the text to extract information can be categorized in three: Free t= ext, structured text and semi-structured text. Web pages falls in the last categ= ory since HTML tags somewhat structures the text on the page. A lot of research= has been conducted in the area of IE and many algorithms have been proposed to = deal with the three kinds of text. We decided to use SRV algorithm which exploits the HTML tags and so treats web pages as semi-structured data. Choosing SRV= as our algorithm we lost the flexibility that is a given with free text proces= sors so SRV is likely to be unable to find some of the information that a free-t= ext counterpart could have detected. However SRV doesn’t need to pre-proc= ess the text and deal with semantics or grammatical rules and makes use of HTML tags. So it will be much faster and less complicated.
There are tw= o types of queries in the CSRP. The first one is the web-search engine-like keyword based querying and the second one is a more sophisticated querying composed= of SQL like query commands.
►
PHASE3.1: Keyword Based Querying:
Keyword Based
Querying [simple querying] that is much like the popular web-search-engines
like querying is provided in this phase. The CSRP provides this querying op=
tion
as a complementary feature of sophisticated querying since IE techniques ca=
n’t
extract everything out of a page and since a user may desire to search for
something that is not extracted by the IE and that is not supported in the
sophisticated querying.
►
PHASE3.2: Sophisticated Querying:
This querying option is available by the exploitation of IE techniques on the document collection coming from the 1<= sup>st phase. The users will enter some limitations on the query interface and the system will convert these queries into commands that the dbms will understand. As explained in the high level design report, the software= in this phase will be the middle tier between the user, dealing with the client interface, and the dbms that receives its comma= nds from our software. After the evaluation of each query the dbms will return the results to our software which will prepare the result page = to be shown to the client.
Document collecting stage is coded in non-object oriented programming with C language. There is crawling library that uses Berkeley DB library and Rainbow classifier library.




Figure 3.1
focusedCrawler.c:
Functions of focusedCrawler.c;
*initialize(int myargc , char *myargv[] , char *data_dir= , char *model_dir , char *mainFoc= us): initializes rainbow classifier.
crawlCore.c:
Functions of crawlCore.c;
*crawl()= b>: initializes mutexes that regulate read from que= ues and start fetch, crawl and DNS threads.
*fetch(voi= d * unused): thread function for fetch and just calls = http_fetchNext function of lib.c
*dns(void * unused): thread function for DNS resolve a= nd just calls http_resolveDNS function from lib.c
*read(void* unused): thread function for crawling and calls ht= tp_crawl function fr= om lib.c
lib.c:
Functions of lib.c;
*http_crawl(): crawls the URL that is read from read queue wi= th online and offline options.
*http_parseURL(http *h, const char *URL): parses given URL into = its host, path and filename.
*http_openConnection(http *h): opens http connection by given host add= ress.
*http_parseHeader(http *h): parses the HTTP header that is returned= as a result of GET message we sent, and extracts content length, content type and response number.
*http_dumbRead(http *h): reads whole content of HTTP message.
*http_extractURL= s: extracts all URL’s in the given page and put them into fetch queue according to their classifiers scores.
*http_fetchNext<= /span>(): fetches URL from fetch queue or TMP queue and puts the URL into DNS queue.<= /p>
*http_resolveDNS= (): reads from DNS queue and resolves DNS address of host then puts URL address into ‘read queue’.
*http_isURLSeen<= /span>(char* URL): checks if URL is seen before or not.
*http_saveFile= span>(http *h): saves the crawled file using host name and path.
*http_isBusy(http *h): checks if host is busy or not in order not to overload the server.=
*http_isSafeToRe= ad(char * url): checks if desired URL is allowed to= crawl by looking robot.txt.
priorityQueue.c:
*http_getNextURL= (char **URL, int depth): returns the first elemen= t in the priority queue.
*http_initialize= sURLPool(char* filename): initializes the priority queue.
*http_insertURL<= /span>(char* URL,double score,char* category, int depth): inserts URL into prio= rity according to scores.
*tempQueue.c<=
/span>:
*http_temp_getNe= xtURL(char* URL, int depth): returns the first element = in the temp queue
*http_temp_initi= alize(): initializes the temp queue.
*http_temp_inser= t_URL(char* URL, int depth): inserts URL into temp queu= e.
readQueue.c:&=
nbsp;
*http_read_initi= alize(): initializes the read queue.
*http_read_inser= tURL(char* URL, int depth): inserts URL into read queu= e.
► PHASE2.1: Information Retr=
ieval:
The I= R stage of the CSRP will be coded in a non-object oriented programming language, so= the software used in this stage will be explained in a module and its functions manner. Main parts of the softwares used in this stage are as follows:
1) Dvector Creation Software:
* void process_relation (char* rel_name): opens a file to be parsed and parses i= t, discarding the undesired tags and the information in these tags. First elim= ination of undesired tags and information in these tags is done in this function
* void process_tuple (char *line, int tuple_no): takes = a line of html document with tags, eliminates the tags and makes the pure informat= ion processed for the document vector
* void read_next_value (char *into): = b>reads a token from the untagged line one by one, by getting rid of the separators=
* int separator(char ch): returns whether a character is a separator that is to be discarded or n= ot
* initialize_doc_vector: initialize= s the doc vector for the current document
The Trie structure is used= in this phase by the ‘Dvector Creation Software’ to detect whether a word [token] has been encountered befor= e or not. It functions are as follows:
* int insertTrie= span>(struct trieNode *root, ch= ar *word, int key): Inserts a key for a word i= n a trie. If the word given as input is EPSILON, or the <= span class=3DSpellE>trie already contains that key, it returns 1. Otherwi= se, it returns 0.
* int searchTrie=
span>(struct trieNode *root, ch=
ar
*word, int *result): S=
earchs
for a word in a trie. Returns the key found in =
the
variable result. If the key is not found, function return 0. In that case,
content of result is not important.
* int trieFilled= span>(struct trieNode *root): <= /b>Checks whether a level of a trie is filled
* void allocateTrie(struct trieNode **root): Allocates space for the r= oot of a trie.
* void deallocateTrie(struct
trieNode *root): Deall=
ocates
a trie with a given root.
* void printTrieToScreen(struct
trieNode *root, char *word): Prints the
2) Inverted Index Creation Software:=
* void process_dvec(char * rel_name): reads each two line of doc vec, for each term e= nters the data into inverted entry
* int add_to_index(int doc_id, int word_no, int rank_in_doc, int cluster_no): adds the word to the inverted index = file
* double dvec_leng=
th(int size) : computes the vector length
* void dvec_norm=
alize
(int size): normalizes the dvector
* void initializ= e_doc_vec(int d_size): initiali= zes the doc vector for the current doc
► PHASE2.2: Information
Extraction:
1-Classifier: Rainbow Library wil= l be used. Details of the rainbow library can be reached at http://www.cs.cmu.edu= /~mccallum/bow/rainbow/
2- Information Extractor:
Naming
Conventions
In Java language:
· Variable names are started with lower case a= nd following words started with upper case.
· Function names are started with lower case a= nd following words started with upper case.
· Class names are started with upper case and = following words started with upper case.

Class
Diagram 1

Class
Diagram 2
Variable: SRV uses predicates like= some(?A, [ ], word, “pm”) , binding tokens to variables, in this case A. This class creates instances of these variables
VariableBoundTable= i>: Holds a vector of Variables, map= ping document tokens to variables.
Predicate: Parent class for real predicates like “some”, “length”, “every̶= 1;. Additional predicates may be implemented in the future.
SomePredicate: Inherits from Predicate class. Implements “some” predicate.
LengthPredicate<= /span>: Inherits from Predicate class. Implements “length” predicate.
EveryPredicate= span>: Inherits from Predicate class. Implements “every” predicate.
SimpleFeatures= span>: This class has static functions = that evaluate simple features and returns true or false. Simple features include "isCapitalized", "word", &q= uot;isNumeric" etc.
Rule: A rule might consist many predicates. Instances of this class hold these predicates and can evaluate whether a given fragment of t= ext is covered by this rule.
HTMLParser: This is a class from an outside package. Constructs an html parser so that string tokens can be used.
RuleExtractor: Once the threshold values have b= een found for different fields, rules are created for each of these fields.
Trainer: This is the main class th= at initiates other objects and controls training process.
Extractor: When the training phase= is over, Extractor uses actual data and extracts information to put in DB.

This cl= ass is the super-class for real predicates, does not have any methods by itself. I= ts purpose is to create a generalization of the other predicates so each can be treated as an instance of Predicate= and can be cast down to their actual class when necessary.

length = -> length of the fragment
comparisonOp -> can be operators >, < and = =3D
boolean isEqualTo -> t= akes a predicate as an argument and returns true if given predicate is equal to th= is
Boolean= evaluatePredicate -> takes a fragment as an argume= nt and compares it to this.length based on comparisonOp

simpleFeatureName -> this can be any of the simple features such as "isPunctuation", &qu= ot;all_upper_case", "all_= lower_case".
value -= > value to check against the result of simple feature evaluation
is_inTagName -> if this feature is between tags, t= his gives the tag name
boolean isEqualTo -> t= akes a predicate as an argument and returns true if given predicate is equal to th= is
Boolean= evaluatePredicate -> Evalaute= s the given fragment such that whether it is consistent with the predicate.= p>

String = [] path -> Contains strings of relational features like {"prev_token", "prev_token"}
simpleFeatureName -> this can be any of the simple features such as "isPunctuation", &qu= ot;all_upper_case", "all_= lower_case".
value -= > value to check against the result of simple feature evaluation
String = comparedToken-> If simple feature is word, then th= is is the compared token.
boolean isEqualTo -> t= akes a predicate as an argument and returns true if given predicate is equal to th= is
Boolean= evaluatePredicate -> Evalaute= s the given fragment such that whether it contains at least one desired token= .

char variableChar -> this can be any char. We prefer ca=
pital
letters.
int referringTokenIndex -= > The index of the referring token in tokens Vector. -1 for nil.

Vector = boundTable -> holds bounded variables
int:getVariableValue -> Given start index and end = index values of a fragment, tries to find the value of the variable if any. If th= ere is no such variable is bounded to any token in this fragment before then returns -1
isTokenBoundedByAnotherVariable -> whether the tok= en that is at given index is bounded by a variable
void boundVar -> binds a variable to a token.

final s= tatic String [] simpleFeatureList -> String array holding simple future names.
final s= tatic String [] punctuationList -> string array ho= lding punctuation marks.
static = boolean: (All the following methods are static, acces= sible from anywhere)
evaluateSimpleFeature -> This function checks the = name of the future and calls the appropriate function from below.
isCapitalized -> whether given token is capitalize= d or not?
word -> = return the lower case version of the input token&n= bsp;
isNumeric -> whether the token is numeric or not?<= /p>
isPunctuation -> whether the given token is a punctuation mark or not.
all_upper_case -> Returns whether the token is all uppercase alphabetic caharacters.
all_lower_case -> Returns whether the token is all lowercase alphabetic caharacters.
isSingleton -> whether token contains only one cha= r or not?
isDoubleton -> whether token contains only two cha= r or not?
isTripleton -> whether token contains only three c= har or not?
isQuadrupleton -> whether token contains only four= char or not?
isLong -> whether token contains more than FOUR ch= ars or not
isAlphanumeric -> whether token is alphanumeric su= ch that it contains alphabetic chars and numeric chars.

Vector = rules -> holds the rule predicates.
void:= p>
addLengthPredicate -> adds a length predicate
addSomePredicate -> adds a some predicate
addEveryPredicate -> adds a every predicate &= nbsp;
boolean:
evaluateRule -> evaluates the given fragment to se= e if evaluates to true
containsLengthPredicate -> if contains a length predicate
containsEveryPredicate -> if contains a every pred= icate
containsSomePredicate -> if contains a some predic= ate

Vector = minValues -> min values for fields
Vector = maxValues -> max values for fields
Vector = fieldNames -> names of fields
String = ThresholdDir -> directory to read values
String = RulesDir -> directory to put rules
Void extractRules -> creates the rules for each field=
p>

This is the= main Trainer class which initializes other components and controls the training process. The main method has calls to other objects methods.

A main = frame containing a panel for adding fields and editing them makes up the GUI for = tagging the html documents in order to prepare them for the training phase.

String = dataDir -> directory for the data to be processed =
String = rulesDir -> rules directory
HTMLParser htmlParser -&g= t; html parser
Once the training phase is complete and rules are extracted, its is a trivial task to use these rules and extract the desired information from a data collection.=
►
PHASE3.1: Keyword Based Querying:
* void process_r= anked_query(char * rel_name): takes a query file etc. and ma= kes it ready for the query running
* void run_ranki= ng_query(DocVec *q_vec, int q_size, int q_no): <= /span>executes the queries
* void showQuery= Results(): prints the results to the screen to be seen by the user
►
PHASE3.2: Sophisticated Querying:
* void run_ranki= ng_query(AdvancedQuery q):&nb= sp; executes the query q that specifies the type, field restrictions= and field specifications of the query
* void showQuery= Results(): prints the results to the screen to be seen by the user
● Rul= e Based Focused Crawling: This a type of web crawler that is an extension of focused crawling approach with tunneling. Instead of deciding whether to fo= llow or neglect the links on a page at the first level, with the use of rules th= is crawler check two level relevance values. The multiplication of the relevan= ce values for these two levels is the main concern in rule based focused crawl= ing that is why it can support tunneling. This type of crawling is used in our system.
● Doc= ument Vector: A document vector is the file containing the information= of which terms occur how many times in which documents. The document vector is converted into inverted file index to be used efficiently in keyword based querying. Since the number of terms in a keyword based query is limited, us= ing inverted file index is much more efficient than using the document vector during the querying phase.
● Inv= erted Indexing: An inverted index is an index structure storing a mapp= ing from words to their locations in a document or a set of documents, giving f= ull text search. An inverted index is the most important data structure used in search engines.
There are two main types of inverted indexes: An inv= erted file index contains for each word a list of references to all the documents= in which it occurs. A full inverted index additionally contains information ab= out where in the documents the words appear. The one used in out project is the inverted file index, containing the information of which terms occur in whi= ch documents.
● SRV Algorithm: A kind of relational learner proposed by Dayne Freitag in 1998.