Title:
|
AUTOMATIC DETERMINATION OF SIMILARITY THRESHOLD FOR FOCUSED CRAWLING PROCESSES ON WEB PAGES |
Author(s):
|
Gustavo Oliveira de Siqueira, Guilherme Tavares de Assis, Anderson Almeida Ferreira, Amanda Sávio Nascimento e Silva, Vítor Mangaravite, Flávio Luis Cardeal Pádua |
ISBN:
|
978-989-8533-57-9 |
Editors:
|
Pedro Isaías |
Year:
|
2016 |
Edition:
|
Single |
Keywords:
|
Similarity threshold, web crawling, focused crawling |
Type:
|
Full Paper |
First Page:
|
95 |
Last Page:
|
102 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
The great popularity and, specially, the fast Web growth have led to the proposal and analysis of new techniques for helping users to locate effectively the needed information in a satisfactory time, without much difficulty. Traditional crawlers are not capable to identify relevant sub-spaces on Web related to a specific theme; however, focused crawlers are capable to solve, effectively and efficiently, the mentioned problem. Usually, a focused crawler process requires a specific value, called similarity threshold value, for determining whether a crawled Web page is relevant or not according to a topic of interest; such value is distinct for each specific topic. In order to determine automatically such a value for focused crawlers related to a genre-aware approach, we propose three strategies in this work. Our experimental evaluation achieved, as the best result, 100% of precision and 98% of F1, considering a specific crawling process for which it was determined automatically a similarity threshold value: a great result compared with the baseline. |
|
|
|
|