Title:
|
WPPS: A NOVEL AND COMPREHENSIVE FRAMEWORK FOR WEB PAGE UNDERSTANDING AND INFORMATION EXTRACTION |
Author(s):
|
Ruslan R. Fayzrakhmanov |
ISBN:
|
978-989-8533-09-8 |
Editors:
|
Bebo White and Pedro Isaías |
Year:
|
2012 |
Edition:
|
Single |
Keywords:
|
Web information extraction, web page understanding, ontological models, object oriented paradigm, declarative approach, bridged adapter |
Type:
|
Full Paper |
First Page:
|
19 |
Last Page:
|
26 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
In this paper, we present WPPS, a new, highly configurable Java-based framework for developing efficient and robust methods that address problems in the fields of web page understanding and information extraction. Furthermore, we introduce the representation of a web page as a unified ontological model (UOM), describing its different aspects such as layout, visual features, interface, DOM tree, and its logical structure, as well as their features and relations. An API provided for the development of new methods makes it possible to combine a declarative approach, represented by a set of inference rules and SPARQL queries, with an object oriented approach. The latter is realised by providing a necessary level of abstraction to work with ontological concepts as Java classes. Abstraction is made via the software design pattern bridged adapter, which is introduced in this paper. We illustrate the framework with one example scenario about web page navigation menu. The framework and the UOM have demonstrated their efficiency in ABBA and TAMCROW projects. |
|
|
|
|