Digital Library

cab1

 
Title:      AUTOMATIC DETERMINATION OF SIMILARITY THRESHOLD FOR FOCUSED CRAWLING PROCESSES ON WEB PAGES
Author(s):      Gustavo Oliveira de Siqueira, Guilherme Tavares de Assis, Anderson Almeida Ferreira, Amanda Sávio Nascimento e Silva, Vítor Mangaravite, Flávio Luis Cardeal Pádua
ISBN:      978-989-8533-57-9
Editors:      Pedro Isaías
Year:      2016
Edition:      Single
Keywords:      Similarity threshold, web crawling, focused crawling
Type:      Full Paper
First Page:      95
Last Page:      102
Language:      English
Cover:      cover          
Full Contents:      click to dowload Download
Paper Abstract:      The great popularity and, specially, the fast Web growth have led to the proposal and analysis of new techniques for helping users to locate effectively the needed information in a satisfactory time, without much difficulty. Traditional crawlers are not capable to identify relevant sub-spaces on Web related to a specific theme; however, focused crawlers are capable to solve, effectively and efficiently, the mentioned problem. Usually, a focused crawler process requires a specific value, called similarity threshold value, for determining whether a crawled Web page is relevant or not according to a topic of interest; such value is distinct for each specific topic. In order to determine automatically such a value for focused crawlers related to a genre-aware approach, we propose three strategies in this work. Our experimental evaluation achieved, as the best result, 100% of precision and 98% of F1, considering a specific crawling process for which it was determined automatically a similarity threshold value: a great result compared with the baseline.
   

Social Media Links

Search

Login