Ripping an image based PDF to text (and old TurboBASIC commands)
Recently, I wanted to pull text from a PDF that was scanned in from an old manual, namely the Borland Turbo Basic manual. The text that was in the document was garbage, nothing there to be done, so I decided to write something that would allow me to:
1. Load a PDF
2. Rip the images from the PDF
3. OCR the images to generate the text
So, here it is.
import fitz # PyMuPDF
from PIL import Image
import pytesseract
import io
# Set the path to the Tesseract executable
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
def pdf_to_text(pdf_path: str):
# Open the PDF file
pdf_document = fitz.open(pdf_path)
text_output = []
# Iterate through each page in the PDF
for page_num in range(pdf_document.page_count):
print(f"Processing page {page_num + 1} of {pdf_document.page_count}...")
# Get the page object
page = pdf_document.load_page(page_num)
# Convert the page to a pixmap (image)
pix = page.get_pixmap()
# Convert the pixmap into a PIL image
img = Image.open(io.BytesIO(pix.tobytes("png")))
# Run OCR on the image using pytesseract
page_text = pytesseract.image_to_string(img)
# Append the extracted text to the list
text_output.append(f"Page {page_num + 1}:\n{page_text}\n")
# Close the document after processing
pdf_document.close()
# Join all pages' text into a single string
return "\n".join(text_output)
if __name__ == "__main__":
pdf_path = "BorTurboBasic.pdf" # Replace with the path to your PDF file
extracted_text = pdf_to_text(pdf_path)
# Save the text to a file
with open("output_text.txt", "w", encoding="utf-8") as f:
f.write(extracted_text)
print("OCR complete. Text extracted and saved to 'output_text.txt'.")
You need to install Tesseract in order to rip the text and put the location into the pytesseract.pytesseract.tesseract_cmd variable. Also, it’s not perfect and it’s not great but it’s better than nothing. If the text is super garbled then it works to simply dump the text into something like Google Gemini and have it rip info you need from it.
Speaking of which, here is a cheat-sheet formatted and parsed from an old Turbo Basic manual.
Turbo Basic Commands (smackaay.com)
Enjoy!
| Posted in Programming | Comments Off on Ripping an image based PDF to text (and old TurboBASIC commands)