Resources


PDFdigest/PDFdigestOCR evaluation datasets based on 27 bilingual (English/Spanish) papers from the SEPLN journal.

The evaluation datasets contain three sets of files:

  1. SEPLN papers for text based PDFdigest eval: This dataset contains the original 27 SEPLN journal papers used to evaluate PDFdigest text-based approach. Download here.

  2. SEPLN papers for image based PDFdigestOCR eval: This dataset contains the image-based PDF version of the original SEPLN papers used to evalute the PDFdigestOCR image-based approach. The papers were converted to image-based PDFs using the following Linux tools executed sequentially: 1) pdftoppm with a resolution of 300 DPI and 2) convert with the A4 layout format. Download here.

  3. SEPLN papers pdf2htmlEX annotated: This dataset contains the manually annotated SEPLN journal papers in GATE XML format. The annotations include: title(s), abstract(s), list of keywords, section headers (up to a depth of three levels), paragraphs, table captions, figure captions and bibliographic entries. The annotation was done over the HTML output of the pdf2htmlEX tool. Download here.