Products.PDFtoOCR (1.1)
- Products.PDFtoOCR download link: http://plone.org/products/pdftoocr/releases
- Homepage of Products.PDFtoOCR: http://plone.org/products/pdftoocr
- Products.PDFtoOCR repository: https://svn.plone.org/svn/collective/Products.PDFtoOCR/trunk/
- Description source: https://svn.plone.org/svn/collective/Products.PDFtoOCR/trunk/README.txt
Configuration
On the operating system
PDF to Text uses three tools that are available for under Linux. The cooperation with the tools is only tested in Debian. But it the will probably work in in other nix enviroments.
Install requirements, PDF to OCR uses the following programs:
- pdftotext, checks if OCR processing is necessary
- ghostscript, converts the pdf documents to tiff images
- tesseract, does the OCR processing (make sure you've got all language packs!*)
On the Plone site
Add a content rule
- Event trigger: Object modified
- Condition: Content type is file
- Actions: Store OCR output from a PDF in searchable text
Assign content rule to a Plone site or a folder
Install cron4plone and add the following cronjob: portal/@@do_pdf_ocr_index
PDF Processing
Each time a file is added or modified the unique id (uid) of the file is added to a queue. This queue is persistent and has two functions, for indexing en reindexing. The indexing function uses the queue to process the documents. When reindexing is used all files in the queue history are processed.
If the text from a PDF document is extracted using pdftotext no OCR is done. Else the OCR extracts the text and stores it the content type file. The ATFile is patched with an extra field to accommodate the extracted text and the language of the PDF.
Page views:
- @@do_pdf_ocr_index - indexes documents in the queue
- @@do_pdf_ocr_reindex - reindexes all pdf documents in the Plone site
- @@pdf_ocr_status - Show the queue and a history 10 documents