Title:
|
EMPIRICAL EVALUATION OF CRF-BASED BIBLIOGRAPHY EXTRACTION FROM RESEARCH PAPERS |
Author(s):
|
Manabu Ohta, Ryohei Inoue, Atsuhiro Takasu |
ISBN:
|
978-972-8939-68-7 |
Editors:
|
Miguel Baptista Nunes, Pedro IsaĆas and Philip Powell |
Year:
|
2012 |
Edition:
|
Single |
Keywords:
|
Bibliography Extraction, Conditional Random Fields (CRF), Error Detection, OCR, Digital Library |
Type:
|
Full Paper |
First Page:
|
18 |
Last Page:
|
26 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
We proposed an automatic bibliography extraction method for research papers scanned with OCR markup. The method uses conditional random fields (CRF) to label serially OCRed text lines in the article title page as appropriate bibliographic element names. Although we achieved good extraction accuracies for some Japanese academic journals, extraction errors are inevitable. Therefore, this paper proposes three confidence measures for bibliography labeling to detect such extraction errors. This paper also reports an empirical evaluation of CRF-based page analysis for research papers on the basis not only of labeling accuracy but also of labeling error detection. We applied the three confidence measures to labeling three academic journals published in Japan. The experiments showed that the proposed confidence measures reasonably indicated the labeling accuracies and could be used for error detection. |
|
|
|
|