Digital Library

cab1

 
Title:      DEVELOPMENT OF A FOCUSED WEB PAGE CRAWLER BASED ON GENRE AND CONTENT
Author(s):      Marcelo Trajano Alves Júnior, Marcos Felipe Pontes Rezende and Guilherme Tavares de Assis
ISBN:      978-989-8704-34-4
Editors:      Pedro Isaías and Hans Weghorn
Year:      2021
Edition:      Single
Type:      Full
First Page:      77
Last Page:      84
Language:      English
Cover:      cover          
Full Contents:      click to dowload Download
Paper Abstract:      Focused crawlers are generally used to crawl pages that satisfy some particular property and that are relevant to a specific topic of interest and are important for a wide variety of applications. For particular situations, a focused crawling approach was proposed and developed where the topic of interest can be expressed by terms that describe the genre and content of the desired web pages. In order to improve the efficiency and effectiveness of such an original genre-aware approach to focused crawling, the following improvements have been proposed, developed and validated: relevant page location policy based on Link Context, semi-automatic seed page determination, automatic similarity threshold definition and automatic refinement of genre and content term sets. In this context, this work proposes to develop a complete and functional version of a crawler, called Yucca, following the original genre-aware approach to focused crawling and the improvements already developed and validated, so that it can be used by different users in a simple and robust way. To validate Yucca, experiments were performed involving the crawling of web pages referring to three distinct and current topics of interest. In general, Yucca presented itself as an effective focused crawler, since the levels of precision achieved by the crawling processes carried out were quite satisfactory, reaching more than 80% on average when considering 10 pages returned as relevant by the crawler.
   

Social Media Links

Search

Login