deskew
deskew copied to clipboard
fails on empty pages - detected angle is too large
deskew 1.30 fails to deskew empty scanned pages with a grey page border
actual: the detected angle is too large, for example 9.535 instead of 0.5 degrees
expected: deskew should use the page border as horizontal and vertical lines or it should do nothing at all (noop) because there are no text lines
example image: 002.tiff.jpg
002.tiff.jpg
$ deskew -o 002.tiff.webp.deskew.jpg 002.tiff.jpg
Deskew 1.30 (2019-06-07) x64 by Marek Mauder
http://galfar.vevb.net/deskew/
Preparing input image (002.tiff.jpg [1855x2596/Gray8]) ...
Calculating skew angle...
Skew angle found [deg]: 9.535
Rotating image...
Saving output (002.tiff.webp.deskew.jpg [2260x2868/Gray8]) ...
Done!
002.tiff.jpg.deskew.jpg
note:
there seems to be a correlation between the maximum angle (the a parameter)
and the actual angle:
maximum_angle - actual_angle = expected_angle
so this may be easy to fix
3 - 2.535 = 0.465
4 - 3.535 = 0.465
5 - 4.535 = 0.465
6 - 5.535 = 0.465
10 - 9.535 = 0.465
$ for a in {1..10}; do echo "a: $a ->" $(deskew -a $a -o 002.tiff.jpg.deskew-angle-$a.jpg 002.tiff.jpg | grep '^Skew angle found'); done
a: 1 -> Skew angle found [deg]: 0.535
a: 2 -> Skew angle found [deg]: 1.505
a: 3 -> Skew angle found [deg]: 2.535
a: 4 -> Skew angle found [deg]: 3.535
a: 5 -> Skew angle found [deg]: 4.535
a: 6 -> Skew angle found [deg]: 5.535
a: 7 -> Skew angle found [deg]: 6.505
a: 8 -> Skew angle found [deg]: 7.535
a: 9 -> Skew angle found [deg]: 8.535
a: 10 -> Skew angle found [deg]: 9.535
empty scanned pages with a grey page border
it also fails on completely white images without any border
example image: 002.tiff.cropped.jpg
$ deskew 002.tiff.cropped.jpg -o 002.tiff.cropped.jpg.deskew.jpg | grep "^Skew angle found"
Skew angle found [deg]: -10.000
workaround: skip deskew on empty pages
065-remove-page-borders.py
#!/usr/bin/env python3
INPUT_DIR = "060-rotate-crop-level"
OUTPUT_DIR = "065-remove-page-borders"
# === Tuning parameters ===
BORDER_SIZE = 10 # pixels
"""
AI prompt:
create a python script to remove grey (or black) page borders from scanned images.
the pages are white with black text.
the pages are no perfect rectangles, rather crooked trapezes with crooked lines...
so the algorithm should "overcut" the pages:
it should cut at the inner-most page edge,
so where the page edge is further outside some white area from the page is removed.
the script should process an input directory with *.tiff images
and write output images to an output directory (same image format).
the input and output paths should be hard-coded in the script,
so the script takes no command-line arguments.
the script should be based on the PIL (pillow) image library
(and on the opencv and numpy libraries when necessary)
...
"""
import os
from PIL import Image
import numpy as np
import cv2
def order_points(pts):
rect = np.zeros((4, 2), dtype="float32")
s = pts.sum(axis=1)
diff = np.diff(pts, axis=1)
rect[0] = pts[np.argmin(s)] # top-left
rect[2] = pts[np.argmax(s)] # bottom-right
rect[1] = pts[np.argmin(diff)] # top-right
rect[3] = pts[np.argmax(diff)] # bottom-left
return rect
def process_image(in_path, out_path):
img = cv2.imread(in_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Threshold to isolate white page
_, mask = cv2.threshold(gray, 230, 255, cv2.THRESH_BINARY)
mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, np.ones((5,5), np.uint8))
# Find contours
contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
page_contour = max(contours, key=cv2.contourArea)
# Approximate contour to a quadrilateral
epsilon = 0.02 * cv2.arcLength(page_contour, True)
approx = cv2.approxPolyDP(page_contour, epsilon, True)
if len(approx) != 4:
print("Warning: contour approximation did not yield 4 points. Using convex hull instead.")
approx = cv2.convexHull(page_contour)
# Optionally select 4 corners from convex hull manually
# Extract the 4 corner points
pts = approx.reshape(4, 2)
# Order the points consistently
rect = order_points(pts)
# Compute perspective transform
# Compute width and height of new rectangle
widthA = np.linalg.norm(rect[2] - rect[3])
widthB = np.linalg.norm(rect[1] - rect[0])
maxWidth = max(int(widthA), int(widthB))
heightA = np.linalg.norm(rect[1] - rect[2])
heightB = np.linalg.norm(rect[0] - rect[3])
maxHeight = max(int(heightA), int(heightB))
# Destination points for the "straight" rectangle
dst = np.array([
[0, 0],
[maxWidth - 1, 0],
[maxWidth - 1, maxHeight - 1],
[0, maxHeight - 1]
], dtype="float32")
# Perspective transform
M = cv2.getPerspectiveTransform(rect, dst)
warped = cv2.warpPerspective(img, M, (maxWidth, maxHeight))
# Add internal white border
# to remove grey artifacts from cropping crooked page edges
# Create a white canvas of the same size
h, w = warped.shape[:2]
canvas = np.ones_like(warped) * 255 # white
# Copy the warped content inside the canvas, leaving a white border
b = BORDER_SIZE
canvas[b:h-b, b:w-b] = warped[b:h-b, b:w-b]
out_image = canvas
# Save the result
print(f"writing {out_path}")
cv2.imwrite(out_path, out_image)
def main():
os.makedirs(OUTPUT_DIR, exist_ok=True)
files = [f for f in sorted(os.listdir(INPUT_DIR)) if f.lower().endswith((".tif", ".tiff"))]
if not files:
print("No TIFF files found in", INPUT_DIR)
return
for f in files:
in_path = os.path.join(INPUT_DIR, f)
out_path = os.path.join(OUTPUT_DIR, f)
if os.path.exists(out_path): continue
try:
process_image(in_path, out_path)
except Exception as e:
print(f"Error processing {f}: {e}")
if __name__ == "__main__":
main()
067-find-empty-pages.sh
#!/usr/bin/env bash
src=065-remove-page-borders
dst=$(basename "$0" .sh)
mkdir -p $dst
t1=$(date --utc +%s)
num_pages=0
for i in $src/*; do
# FIXME use $num_pages and $scan_format
page_number=${i%.tiff}
page_number=${page_number##*/}
page_number=${page_number#0}
page_number=${page_number#0}
page_number=${page_number#0}
page_number=${page_number#0}
if ! lightness=$(magick "$i" -colorspace gray -format "%[fx:mean*100]" info:); then
# echo "error: failed to get lightness of image $i" >&2
lightness="-1"
fi
# echo "$(LC_ALL=C printf '%08.4f' $lightness) ${i##*/}"
echo "$(LC_ALL=C printf '%08.4f' $lightness) $page_number"
num_pages=$((num_pages + 1))
# [ "$page_number" = 10 ] && break # debug
done |
tee -a /dev/stderr |
sort -r -g \
>$dst.txt
t2=$(date --utc +%s)
echo "done $num_pages pages in $((t2 - t1)) seconds"
070-deskew.sh
#!/usr/bin/env bash
cd "$(dirname "$0")"
src=065-remove-page-borders
dst=$(basename "$0" .sh)
# empty pages:
# 100.0000
# 099.9999
# 099.9997
# ...
src_empty_pages_txt=067-find-empty-pages.txt
src_empty_pages_pattern='^(099\.999|100\.0000)'
mkdir -p $dst
t1=$(date --utc +%s)
num_pages=0
# array
empty_pages=(
$(grep -E "$src_empty_pages_pattern" "$src_empty_pages_txt" | cut -c10- | sort -n)
)
if [ ${#empty_pages[@]} != 0 ]; then
echo skipping deskew on empty pages: ${empty_pages[@]}
fi
# dict
declare -A is_empty_page
for page_number in ${empty_pages[@]}; do is_empty_page[$page_number]=1; done
for i in $src/*; do
# FIXME use $num_pages and $scan_format
page_number=${i%.tiff}
page_number=${page_number##*/}
page_number=${page_number#0}
page_number=${page_number#0}
page_number=${page_number#0}
page_number=${page_number#0}
o=$dst/${i##*/}
[ -e "$o" ] && continue
if [ "${is_empty_page[$page_number]}" = 1 ]; then
echo skipping deskew on empty page $page_number
cp "$i" "$o"
continue
fi
deskew_args=(deskew -o "$o")
# add white background
deskew_args+=(-b FFFFFF)
# -a angle: Maximal expected skew angle (both directions) in degrees (default: 10)
# expected angle is -0.5 or +0.5
# deskew_args+=(-a 1)
deskew_args+=("$i")
echo + "${deskew_args[@]}"
"${deskew_args[@]}"
num_pages=$((num_pages + 1))
# [ "$page_number" = 10 ] && break # debug
done
t2=$(date --utc +%s)
echo "done $num_pages pages in $((t2 - t1)) seconds"
see also my hocr-files-template-repo