Title:
|
DEVELOPMENT OF A FOCUSED WEB PAGE CRAWLER
BASED ON GENRE AND CONTENT |
Author(s):
|
Marcelo Trajano Alves Júnior, Marcos Felipe Pontes Rezende and Guilherme Tavares de Assis |
ISBN:
|
978-989-8704-34-4 |
Editors:
|
Pedro Isaías and Hans Weghorn |
Year:
|
2021 |
Edition:
|
Single |
Type:
|
Full |
First Page:
|
77 |
Last Page:
|
84 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
Focused crawlers are generally used to crawl pages that satisfy some particular property and that are relevant to a specific
topic of interest and are important for a wide variety of applications. For particular situations, a focused crawling
approach was proposed and developed where the topic of interest can be expressed by terms that describe the genre and
content of the desired web pages. In order to improve the efficiency and effectiveness of such an original genre-aware
approach to focused crawling, the following improvements have been proposed, developed and validated: relevant page
location policy based on Link Context, semi-automatic seed page determination, automatic similarity threshold definition
and automatic refinement of genre and content term sets. In this context, this work proposes to develop a complete and
functional version of a crawler, called Yucca, following the original genre-aware approach to focused crawling and the
improvements already developed and validated, so that it can be used by different users in a simple and robust way. To
validate Yucca, experiments were performed involving the crawling of web pages referring to three distinct and current
topics of interest. In general, Yucca presented itself as an effective focused crawler, since the levels of precision achieved
by the crawling processes carried out were quite satisfactory, reaching more than 80% on average when considering 10
pages returned as relevant by the crawler. |
|
|
|
|