Title:
|
SEMI-AUTOMATIC GENERATION OF SEED PAGES IN GENRE-AWARE FOCUSED CRAWLING |
Author(s):
|
Vítor Mangaravite, Guilherme Tavares de Assis, Anderson Almeida Ferreira, Flávio Luis Cardeal Pádua |
ISBN:
|
978-989-8533-24-1 |
Editors:
|
Pedro Isaías and Bebo White |
Year:
|
2014 |
Edition:
|
Single |
Keywords:
|
Seed pages, meta-crawling, focused crawling. |
Type:
|
Full Paper |
First Page:
|
51 |
Last Page:
|
58 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
Focused crawlers attempt to crawl web pages that are relevant to a specific topic or user interest. Although these kinds of crawlers have been demonstrated to be effective, they need improvement in their efficiency. Focused crawlers usually use a priority queue, called Frontier, that is initialized with the URLs of the seed pages, manually specified by users, in order to visit the web pages and gather relevant pages. If seed pages are not well specified, the efficiency of a crawling process may be unsatisfactory. Thus, in this work, we propose and evaluate a strategy for semi-automatic generation of seed pages to improve the efficiency of a genre-aware focused crawler. Our experimental evaluation shows, in some situations, an improvement around 360% in efficiency of crawling processes. |
|
|
|
|