09.15.24

Ripping an image based PDF to text (and old TurboBASIC commands)

Recently, I wanted to pull text from a PDF that was scanned in from an old manual, namely the Borland Turbo Basic manual. The text that was in the document was garbage, nothing there to be done, so I decided to write something that would allow me to:

1. Load a PDF
2. Rip the images from the PDF
3. OCR the images to generate the text

So, here it is.

import fitz  # PyMuPDF
from PIL import Image
import pytesseract
import io

# Set the path to the Tesseract executable
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

def pdf_to_text(pdf_path: str):
    # Open the PDF file
    pdf_document = fitz.open(pdf_path)
    text_output = []

    # Iterate through each page in the PDF
    for page_num in range(pdf_document.page_count):
        print(f"Processing page {page_num + 1} of {pdf_document.page_count}...")
        
        # Get the page object
        page = pdf_document.load_page(page_num)
        
        # Convert the page to a pixmap (image)
        pix = page.get_pixmap()
        
        # Convert the pixmap into a PIL image
        img = Image.open(io.BytesIO(pix.tobytes("png")))
        
        # Run OCR on the image using pytesseract
        page_text = pytesseract.image_to_string(img)

        # Append the extracted text to the list
        text_output.append(f"Page {page_num + 1}:\n{page_text}\n")

    # Close the document after processing
    pdf_document.close()

    # Join all pages' text into a single string
    return "\n".join(text_output)

if __name__ == "__main__":
    pdf_path = "BorTurboBasic.pdf"  # Replace with the path to your PDF file
    extracted_text = pdf_to_text(pdf_path)

    # Save the text to a file
    with open("output_text.txt", "w", encoding="utf-8") as f:
        f.write(extracted_text)

    print("OCR complete. Text extracted and saved to 'output_text.txt'.")

You need to install Tesseract in order to rip the text and put the location into the pytesseract.pytesseract.tesseract_cmd variable. Also, it’s not perfect and it’s not great but it’s better than nothing. If the text is super garbled then it works to simply dump the text into something like Google Gemini and have it rip info you need from it.

Speaking of which, here is a cheat-sheet formatted and parsed from an old Turbo Basic manual.

Turbo Basic Commands (smackaay.com)

Enjoy!

Tags: , , , , , , , , , , , , , ,
| Posted in Programming | Comments Off on Ripping an image based PDF to text (and old TurboBASIC commands)