Title:
|
DATA EXTRACTION FROM WEB DATABASE QUERY RESULT PAGES VIA TAGSETS AND INTEGER SEQUENCES |
Author(s):
|
Jerome Robinson |
ISBN:
|
972-98947-1-X |
Editors:
|
Pedro IsaĆas and Nitya Karmakar |
Year:
|
2003 |
Edition:
|
2 |
Keywords:
|
Web database, data extraction, wrapper. |
Type:
|
Full Paper |
First Page:
|
145 |
Last Page:
|
152 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
The World Wide Web is a collection of databases as well as web sites. Databases associated with web sites provide public access via query forms on web pages. They constitute an enormous repository of searchable data on an extremely diverse collection of subjects, ranging from multimedia collections through archives of subject-specific data to current information such as currency conversion or interest rates and news or weather reports. Many interesting and valuable Database Applications could be developed if these databases were easily and reliably accessible to programs. The difficulty in extracting data is the number of different web page formats and the tendency to change format suddenly. A rapid page analysis and wrapper creation system is needed to generate and maintain a data extraction facility for any required web sites. This important goal has been the subject of substantial recent research, modelling the web results page in various ways.
The purpose of the current paper is to introduce a new method for rapid page analysis using the recurrence patterns of tagSet occurrence in the page. |
|
|
|
|