Title:
|
SUPERVISED DATA EXTRACTION |
Author(s):
|
N.georgiev , J.m.labat , J.l.minel , L.nicolas |
ISBN:
|
972-8924-02-X |
Editors:
|
Pedro Isaías and Miguel Baptista Nunes |
Year:
|
2005 |
Edition:
|
1 |
Keywords:
|
Wrappers, wrapper generation, data extraction, HTML parsing, information extraction, XPath. |
Type:
|
Full Paper |
First Page:
|
467 |
Last Page:
|
474 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
The process of data extraction from internet sources has been arousing the interest of the scientific community for the past years. However, there are still no well established standards because of the heterogeneous nature of the information in the Global Network. Nevertheless, there is still something in common all the data is available in HTML format for compatibility reasons. This article presents our methodology and the prototype system we have created to extract data from HTML pages. We have used XPath as data extraction language and have developed a methodology for visual wrapper generation. Our approach takes advantage of the implicit correlation between the data and the surrounding structure. |
|
|
|
|