Ripping an image based PDF to text (and old TurboBASIC commands)
Recently, I wanted to pull text from a PDF that was scanned in from an old manual, namely the Borland Turbo Basic manual. The text that was in the document was garbage, nothing there to be done, so I decided to write something that would allow me to:
1. Load a PDF
2. Rip the images from the PDF
3. OCR the images to generate the text
So, here it is.
import fitz # PyMuPDF
from PIL import Image
import pytesseract
import io
# Set the path to the Tesseract executable
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
def pdf_to_text(pdf_path: str):
# Open the PDF file
pdf_document = fitz.open(pdf_path)
text_output = []
# Iterate through each page in the PDF
for page_num in range(pdf_document.page_count):
print(f"Processing page {page_num + 1} of {pdf_document.page_count}...")
# Get the page object
page = pdf_document.load_page(page_num)
# Convert the page to a pixmap (image)
pix = page.get_pixmap()
# Convert the pixmap into a PIL image
img = Image.open(io.BytesIO(pix.tobytes("png")))
# Run OCR on the image using pytesseract
page_text = pytesseract.image_to_string(img)
# Append the extracted text to the list
text_output.append(f"Page {page_num + 1}:\n{page_text}\n")
# Close the document after processing
pdf_document.close()
# Join all pages' text into a single string
return "\n".join(text_output)
if __name__ == "__main__":
pdf_path = "BorTurboBasic.pdf" # Replace with the path to your PDF file
extracted_text = pdf_to_text(pdf_path)
# Save the text to a file
with open("output_text.txt", "w", encoding="utf-8") as f:
f.write(extracted_text)
print("OCR complete. Text extracted and saved to 'output_text.txt'.")
You need to install Tesseract in order to rip the text and put the location into the pytesseract.pytesseract.tesseract_cmd variable. Also, it’s not perfect and it’s not great but it’s better than nothing. If the text is super garbled then it works to simply dump the text into something like Google Gemini and have it rip info you need from it.
Speaking of which, here is a cheat-sheet formatted and parsed from an old Turbo Basic manual.
Turbo Basic Commands (smackaay.com)
Enjoy!
08.25.24Demographic charts
So, I’m pretty interested in the state of the world and how populations in wealthier countries are starting to fall. I was kind of looking at charts, population pyramid charts to be exact and I found that population pyramids, while interesting don’t show the decline in as interesting way as a line graph. So I took it upon myself to grab the info and change them into line graphs that can be viewed more easily. Below are a bunch for your perusal.
Working age population – WorldWorking age population – Canada
Working age population – Japan
Working age population – Least Developed Countries
Working age population – More Developed Countries
Working age population – Russian Federation
Working age population – United Kingdom
Working age population – United States
Population – World
Population – Canada
Population – Japan
Population – Least Developed Countries
Population – More Developed Countries
Population – Russian Federation
Population – United Kingdom
Population – United States
Population Percentage – World
Population Percentage – Canada
Population Percentage – Japan
Population Percentage – Least Developed Countries
Population Percentage – More Developed Countries
Population Percentage – Russian Federation
Population Percentage – United Kingdom
Population Percentage – United States
Working Population Percentage – World
Working Population Percentage – Canada
Working Population Percentage – Japan
Working Population Percentage – Least Developed Countries
Working Population Percentage – More Developed Countries
Working Population Percentage – Russian Federation
Working Population Percentage – United Kingdom
Working Population Percentage – United States
There we go, it’s pretty straightforward. You can click the labels on the side to turn on/off certain lines to highlight what you want to see. Clicking on a link will open new window, also, if you change the windows size, please refresh to make it fit your window. To explain we have a number of data sets here, population of a given region, percentage of population of a given region, population of people of working age of a given region and finally a percentage of the working population of a given region. I chose regions/countries that made sense to me. I could’ve picked China but I don’t trust their numbers on anything, even bad numbers. Enjoy!
I think one of the neat things to look at as a Canadian is the percentage based charts, particularly those of working age. I separated the charts out as a basis of percentage in groups from 0-14, 15-64 and 65+. You look at these things and you can see why we’re having some issues. For example, we here in Canada have 45% of the children we had in 1960 on a per capita basis. We also crossed the threshold of having more 65+ people than working age people back in 2010, the line was crossed for women back in 1988.
Another thing I see is that, for Canada, the percentage of adults goes up in bumps over the years but not so much for children. I suspect that is because despite having so much immigration in certain spots, even they are not having that many children either. That’s interesting to me as well.
Anyways, hope you enjoy. These charts are an interesting way to look at demographics in a different way from pyramid charts.
| Posted in Miscellaneous stuff | Comments Off on Demographic charts
New prompt permutation script
A while back I made a prompt permutation script for generating large numbers of image prompts for use in automatic1111. I updated it with a new operator, the incremental operator ‘&’ so it will cycle through the list items instead of choosing random ones. Here is a sample prompt and output. Basically a fancy search and replace but I use it quite often.
photo, a %SIZE brutalist &BUILDING on a sunny day (this is the base prompt)
photo, painting
brutalist, post-modern, deconstructivism
sunny day, night time
%SIZE, small, medium, large, huge
&BUILDING, house, tower, factory, school
Output:
photo, a small brutalist house on a sunny day
photo, a huge brutalist tower on a night time
photo, a medium post-modern factory on a sunny day
photo, a medium post-modern school on a night time
photo, a medium deconstructivism house on a sunny day
photo, a large deconstructivism tower on a night time
painting, a medium brutalist factory on a sunny day
painting, a small brutalist school on a night time
painting, a huge post-modern house on a sunny day
painting, a small post-modern tower on a night time
painting, a large deconstructivism factory on a sunny day
painting, a medium deconstructivism school on a night time
Anyways, here is the python script along with an html version so you can use it with an interface of sorts.
http://smackaay.com/files/ppermute/ppermute.html The little webpage for it.
Here is the python script.
import itertools
import random
# File path assignments
INPUT_FILE_PATH = 'img5.txt' # Change this to the path of your input file
OUTPUT_FILE_PATH = 'output.txt' # Change this to the desired path for the output file
def load_file(file_path):
with open(file_path, 'r') as file:
lines = file.readlines()
return [line.strip() for line in lines]
def generate_permutations(prompt, modifiers, random_modifiers, increment_modifiers):
all_combinations = list(itertools.product(*modifiers))
permutations = []
increment_counters = {key: 0 for key in increment_modifiers.keys()}
for combination in all_combinations:
new_prompt = prompt
for original, replacement in zip(modifiers, combination):
new_prompt = replace_first(new_prompt, original[0], replacement)
# Handle random modifiers
for placeholder, values in random_modifiers.items():
if placeholder in new_prompt:
replacement = random.choice(values)
new_prompt = new_prompt.replace(placeholder, replacement, 1)
# Handle increment modifiers
for placeholder, values in increment_modifiers.items():
if placeholder in new_prompt:
replacement = values[increment_counters[placeholder] % len(values)]
new_prompt = new_prompt.replace(placeholder, replacement, 1)
increment_counters[placeholder] += 1
# Remove placeholders from the final prompt
new_prompt = remove_placeholders(new_prompt, random_modifiers.keys() | increment_modifiers.keys())
permutations.append(new_prompt)
return permutations
def replace_first(text, search, replacement):
if search not in text:
raise ValueError(f"Term '{search}' not found in the prompt.")
return text.replace(search, replacement, 1)
def remove_placeholders(text, placeholders):
for placeholder in placeholders:
text = text.replace(placeholder, "")
return text
def save_to_file(output_path, permutations):
with open(output_path, 'w') as file:
for permutation in permutations:
file.write(permutation + '\n')
def main():
lines = load_file(INPUT_FILE_PATH)
if not lines:
print("The input file is empty.")
return
prompt = lines[0]
modifiers = [line.split(', ') for line in lines[1:] if not line.startswith('%') and not line.startswith('&')]
random_modifiers = {}
increment_modifiers = {}
for line in lines[1:]:
if line.startswith('%'):
parts = line.split(', ')
key = parts[0]
values = parts[1:]
random_modifiers[key] = values
elif line.startswith('&'):
parts = line.split(', ')
key = parts[0]
values = parts[1:]
increment_modifiers[key] = values
try:
all_permutations = generate_permutations(prompt, modifiers, random_modifiers, increment_modifiers)
save_to_file(OUTPUT_FILE_PATH, all_permutations)
print(f"Generated prompts have been saved to {OUTPUT_FILE_PATH}")
print(f"Total number of permutations: {len(all_permutations)}")
except ValueError as e:
print(f"Error: {e}")
if __name__ == "__main__":
main()
| Posted in Personal stuff, Programming | Comments Off on New prompt permutation script
Gage Block Buildup Calculator
So, I have that Python source code on the side of my site there for calculating gage block buildups. I figured it was time to turn it into a JS program so that people can just access it from the web. Not super complicated but useful nonetheless.
http://smackaay.com/files/gbcalc/gbcalc.html
Features as follows:
- Imperial 81, 28, 34, 36 and 92 pc sets
- Metric 88, 47 and 112 pc sets
- Multiple ( as many as you want) results
- The ability to remove blocks from the list if they are either missing or used in a previous buildup. This is handy.
Anyways, hope somebody out there enjoys this!
| Posted in Machining, Programming | Comments Off on Gage Block Buildup Calculator
The YouTube Recycle Bin
I was watching a video from a youtuber KVN AUST. The video: https://youtu.be/8uHFm6LK6PE?si=SLIaCEzNBx_iL97V It featured a map for looking at and searching for odd videos across YouTube. It’s pretty fun just to see little slices of life or weird things people would bother uploading so I made a little JS proggy to generate the most common search terms.
Select the prefix and the type of random term you want to find, click on Generate Search Term and then click Search on YouTube. You can select No Spaces or With Quotes if certain things don’t work. The random date is anything in the last 20 years. Enjoy!
YouTube Recycle Bin Search Generator
| Posted in Personal stuff, Programming | Comments Off on The YouTube Recycle Bin
A visit from an old friend, the boreGauge
A few years back we made a gauge for measuring large bores in hydraulic cylinders. Seems the company that bought it from us needed the software for it again. I had to dig through my old source code and see if I had a recent version, turns out I did. On this project I did the electronics, software and commissioned it.
What the device does is, you place it in the bore, set your zeros and then measure the bore all the way down. This way you can see if there are any high spots, low spots or waviness. The software keeps track of the position as well and provides a csv file of the data and plots it on the screen.
This was a pretty fun project, I might redesign it and make a more substantial attempt at monetizing it later.
StableDiffusion Permutation Script Update
So, like a week ago I wrote a script to make permutations for SD prompts. I’ve updated the script to allow for random terms as well. This allows one to add variance in the prompt but to not add to the number of permutations. Everything is explained in the code block comment. just change the filenames near the end of the script and run.
"""
Script: prompt_permutator.py
Description:
This script generates permutations of a given prompt with various modifiers.
The script reads an input file that contains a base prompt and lists of modifiers.
It creates all possible combinations of the modifiers and generates new prompts
based on these combinations. Additionally, it handles placeholders that are randomly
replaced with specified values and ensures these placeholders are not included in the final output.
Input File Format:
- The first line contains the base prompt.
- Subsequent lines contain comma-separated lists of modifiers.
- Lines starting with a placeholder (e.g., %1) are treated as random modifiers and are replaced
with random values from the list provided.
Example Input File (test2.txt):
A %1 flower on a hill, photorealistic
on a hill, in a vase, on a bed
photorealistic, manga
%1, Red, Green, Blue
In this example:
- The base prompt is: "A %1 flower on a hill, photorealistic"
- The modifiers are: ["on a hill", "in a vase", "on a bed"] and ["photorealistic", "manga"]
- The placeholder %1 will be replaced with a random choice from ["Red", "Green", "Blue"]
Output:
The script generates all permutations of the prompt with the modifiers and replaces
the placeholder with a random value. The results are saved to an output file.
Usage:
1. Prepare an input file (e.g., 'test2.txt') following the described format.
2. Specify the input and output file paths in the script or pass them as arguments.
3. Run the script to generate the permutations and save them to the output file.
Example Execution:
$ python prompt_permutator.py
Dependencies:
- itertools
- random
Author:
Steven M
Date:
May 28, 2024
"""
import itertools
import random
def load_file(file_path):
with open(file_path, 'r') as file:
lines = file.readlines()
return [line.strip() for line in lines]
def generate_permutations(prompt, modifiers, random_modifiers):
# Create all combinations of modifiers
all_combinations = list(itertools.product(*modifiers))
permutations = []
for combination in all_combinations:
new_prompt = prompt
for original, replacement in zip(modifiers, combination):
new_prompt = replace_first(new_prompt, original[0], replacement)
# Handle random modifiers
for placeholder, values in random_modifiers.items():
if placeholder in new_prompt:
replacement = random.choice(values)
new_prompt = new_prompt.replace(placeholder, replacement, 1)
# Remove placeholders from the final prompt
new_prompt = remove_placeholders(new_prompt, random_modifiers.keys())
permutations.append(new_prompt)
return permutations
def replace_first(text, search, replacement):
# Helper function to replace only the first occurrence of a term
if search not in text:
raise ValueError(f"Term '{search}' not found in the prompt.")
return text.replace(search, replacement, 1)
def remove_placeholders(text, placeholders):
for placeholder in placeholders:
text = text.replace(placeholder, "")
return text
def save_to_file(output_path, permutations):
with open(output_path, 'w') as file:
for permutation in permutations:
file.write(permutation + '\n')
def main(input_file_path, output_file_path):
lines = load_file(input_file_path)
if not lines:
print("The input file is empty.")
return
prompt = lines[0]
modifiers = [line.split(', ') for line in lines[1:] if not line.startswith('%')]
random_modifiers = {}
for line in lines[1:]:
if line.startswith('%'):
parts = line.split(', ')
key = parts[0]
values = parts[1:]
random_modifiers[key] = values
try:
all_permutations = generate_permutations(prompt, modifiers, random_modifiers)
save_to_file(output_file_path, all_permutations)
print(f"Generated prompts have been saved to {output_file_path}")
print(f"Total number of permutations: {len(all_permutations)}")
except ValueError as e:
print(f"Error: {e}")
if __name__ == "__main__":
input_file_path = 'test2.txt' # Change this to the path of your input file
output_file_path = 'output.txt' # Change this to the desired path for the output file
main(input_file_path, output_file_path)
So, in essence, anything with a % at the beginning of the line will be processed differently and the term will be matched up. As always, I guarantee nothing.
| Posted in Programming | Comments Off on StableDiffusion Permutation Script Update
Pong-2024
I was bored and made a quick Pong game. It’s not great, not terribly well finished but I wanted to see how good the tools are these days. It’s been a while since I wrote a game. It was fun to make. Give it a shot.
https://smackaay.com/webgames/pong2024/index.html
It’s output in HTML5 so no installation is required.
05.25.24Calgary Zoo and Torrington Gopher Museum
Last weekend we decided to go with my parents for a quick trip to our neighbors to the south and visit the Calgary Zoo, It was a big place. Lots of cool animals, nice facilities. everything was pretty good.
Here’s a few images as well from the zoo. It was overcast for the most part so it wasn’t great for photography but it was ok.
So, we went to Crossiron Mills and New Horizons Mall. I quite liked New Horizons. If I want to go to one of the 3 quintillion trash stores, I can find them at CrossIron Mills. If I want to see some smaller businesses where they sell things the owners know and give a crap about, I’ll go to New Horizons. Unfortunately it doesn’t seem to be terribly successful even a few years later from opening. Oddly enough, I think the location is wrong for that kind of business style. But whatever.
We took a detour to Torrington where the have the WORLD FAMOUS GOPHER HOLE MUSEUM!!! It’s a charming little museum with dioramas of taxidermy gophers in various settings. Cute in it’s own way. It was surprisingly busy and they seem to really care about their museum and town proper. very cool.
Anyways, http://worldfamousgopherholemuseum.ca/ is a fun little place. go check it out!
| Posted in Personal stuff | Comments Off on Calgary Zoo and Torrington Gopher Museum
Resolutions for SD image generation
When making images for StableDiffusion it’s best to take the aspect ratio in mind and make it fit into the total number of pixels that the model was trained on. This results in the best images for that given model. So, for SDXL it’s 1024×1024, others it may be 768×768 or even 512×512. Here is a list of effective X and Y values to total up to the most common aspect ratios for various training sizes. Obviously you would reverse the values if you go y/x.
1024x1024
Aspect Ratio 4:3 - Resolution: 1182x886
Aspect Ratio 16:9 - Resolution: 1365x768
Aspect Ratio 21:9 - Resolution: 1564x670
Aspect Ratio 1:1 - Resolution: 1024x1024
Aspect Ratio 3:2 - Resolution: 1254x836
Aspect Ratio 5:4 - Resolution: 1144x915
Aspect Ratio 16:10 - Resolution: 1295x809
Aspect Ratio 2:1 - Resolution: 1448x724
Aspect Ratio 18:9 - Resolution: 1448x724
Aspect Ratio 32:9 - Resolution: 1930x543
Aspect Ratio 3:1 - Resolution: 1773x591
Aspect Ratio 4:1 - Resolution: 2048x512
Aspect Ratio 5:3 - Resolution: 1321x793
768x768
Aspect Ratio 4:3 - Resolution: 886x665
Aspect Ratio 16:9 - Resolution: 1024x576
Aspect Ratio 21:9 - Resolution: 1173x502
Aspect Ratio 1:1 - Resolution: 768x768
Aspect Ratio 3:2 - Resolution: 940x627
Aspect Ratio 5:4 - Resolution: 858x686
Aspect Ratio 16:10 - Resolution: 971x607
Aspect Ratio 2:1 - Resolution: 1086x543
Aspect Ratio 18:9 - Resolution: 1086x543
Aspect Ratio 32:9 - Resolution: 1448x407
Aspect Ratio 3:1 - Resolution: 1330x443
Aspect Ratio 4:1 - Resolution: 1536x384
Aspect Ratio 5:3 - Resolution: 991x594
512x512
Aspect Ratio 4:3 - Resolution: 591x443
Aspect Ratio 16:9 - Resolution: 682x384
Aspect Ratio 21:9 - Resolution: 782x335
Aspect Ratio 1:1 - Resolution: 512x512
Aspect Ratio 3:2 - Resolution: 627x418
Aspect Ratio 5:4 - Resolution: 572x457
Aspect Ratio 16:10 - Resolution: 647x404
Aspect Ratio 2:1 - Resolution: 724x362
Aspect Ratio 18:9 - Resolution: 724x362
Aspect Ratio 32:9 - Resolution: 965x271
Aspect Ratio 3:1 - Resolution: 886x295
Aspect Ratio 4:1 - Resolution: 1024x256
Aspect Ratio 5:3 - Resolution: 660x396
So, if for some reason you need to calculate this on your own for some future or past resolution, here is the Python.
from sympy import symbols, Eq, solve
# Define symbols
x, y = symbols('x y')
# Equation 1: Total pixel count remains constant
total_pixels = 512*512
# List of common aspect ratios as tuples (width, height)
aspect_ratios = [
(4, 3), (16, 9), (21, 9), (1, 1), (3, 2),
(5, 4), (16, 10), (2, 1), (18, 9), (32, 9),
(3, 1), (4, 1), (5, 3)
]
# Iterate over the aspect ratios and solve the equations
resolutions = []
for width_ratio, height_ratio in aspect_ratios:
# Equation 2: Aspect ratio
eq1 = Eq(x * y, total_pixels)
eq2 = Eq(x / y, width_ratio / height_ratio)
# Solve the equations
solution = solve((eq1, eq2), (x, y))
# Extract the resolution and convert to positive integers
resolution = (abs(int(solution[0][0])), abs(int(solution[0][1])))
resolutions.append((width_ratio, height_ratio, resolution))
# Print the results
for width_ratio, height_ratio, resolution in resolutions:
print(f"Aspect Ratio {width_ratio}:{height_ratio} - Resolution: {resolution[0]}x{resolution[1]}")
As always, I guarantee nothing. enjoy.
| Posted in Miscellaneous stuff, Programming | Comments Off on Resolutions for SD image generation
| Posted in Programming | Comments Off on Ripping an image based PDF to text (and old TurboBASIC commands)