How can I improve the quality of pixelated text in scanned PDF images and convert it into non-pixelated, high-quality digital text?

I have a scanned PDF document containing images with pixelated text. The OCR process has extracted the text, but it appears low quality and pixelated. I want to convert this pixelated text into a high-quality digital font or vector format, so it retains its clarity and smoothness. I have already attempted optical character recognition (OCR) and can copy the text, but it lacks the desired quality. The text in the scanned images looks jagged and blurry, making it challenging to read. I want to improve the text quality and convert it into a digital font or vector format that is crisp, clear, and non-pixelated. What steps and tools can I use to enhance the pixelated text in the scanned PDF images? Is there any specific software or technique that can help me achieve this? Additionally, what are the best practices for converting this improved text into a high-quality digital font or vector format? Any guidance or recommendations on image editing software, font digitization tools, or suitable workflows would be greatly appreciated. Thank you! A Page of scanned PDF file Digital PDF

50.1k 9 9 gold badges 113 113 silver badges 143 143 bronze badges asked Jul 3, 2023 at 13:49 31 1 1 silver badge 2 2 bronze badges

Ah, I think you'll find that (as the original poster noted) OCR is an abbreviation for Optical Character Recognition, not One Char Replacement. I doubt the OCR process has made the text pixellated and blurry, I imagine the original scanned image was like that, though without c file to look at, it's impossible to say for sure. Essentially, though, I agree with KJ that once the damage is done (poor quality scan) you can't do much to improve it. You could upscale with some kind of anti-aliasing to get a smoother result, but it's never going to be great.

Commented Jul 3, 2023 at 15:37

From the posted image, it looks like the scan has used JPEG compression (also the back page seems to shine through), and that has the usual lossy compression artefacts. Your best bet is to rescan the content and use a lossless compression. Alternatively converting to black and white and adjusting the histogram so that anything below a quite dark value is mapped to white, and the remainder is mapped to black, then applying something like a 'sharp' filter might improve matters. Of course if there are any colour illustrations those will be totally broken by the process.

Commented Jul 3, 2023 at 16:10

2 Answers 2

You have two issues…

Your source image is far too low-quality to successfully OCR. Even cleaned up in Photoshop & switched to black & white, a human can read this, but a machine can't.
[More advanced AI may be able to. This is 'regular' OCR - ReadIris, a few years old now, was free with an HP Printer.]

You need to significantly increase the resolution of your scans.

You're saving your PDF the 'wrong way up'. Most OCR software has options for PDF, determining how the PDF should be presented.

I'm guessing you have 'Image over Text' which will present the file looking just like the original scan, but with hidden 'real' selectable text underneath. In a PDF reader it will look like this, with some text selected. The actual selection is not of the image, but of the hidden text underneath.

If you flip the presentation order to 'Text over Image' then you would instead see this…

Still terrible, because your scan is not properly readable [from issue 1.]

If you save as Text only, you would then see this…

I've enlarged this one so you can see that - though it's total garbage - it's at least sharp garbage. This is now entirely vector, no raster image at all, so it will always be sharp.

So, fixing issue 1 will then allow you to change issue 2 in order to preserve [legible] vector-based PDFs.

If you need to also preserve images, then you need to choose whether Image over Text or Text over Image looks best. Test a few pages of each type.

answered Jul 4, 2023 at 11:56 50.1k 9 9 gold badges 113 113 silver badges 143 143 bronze badges

One of the best ways to improve a scanned source is to use the original again so here is that area as seen by a 200 DPI TIFF fax machine, where we are at the limits for recognising words.

enter image description here

However there should be no fixation of resolution. Here is the original screen at lower 96 DPI density. so it looks better for being pure colour tones without any JPG content or bleed through to confuse any OCR device. enter image description here

The problem is when captured that 96 DPI looks like this in a computer program

enter image description here

However since it is clean it works well in an online OCR pixels to Words sharp Vector character processor, but will be better if a higher density such as 192 dpi.

enter image description here

So you may complain "Unfair you used a clean source scan" so as to illustrate your point, and that is the whole point, that a bad JPEG lousy scan is nowhere near as good to produce any meaningful result compared to a good fresh, even a lower density PNG style of scan.

Going back to resolution there is a problem area where here at 192 dpi the text is not clearly readable as single characters (OCR will attempt to replace characters one by one , then detect a word from those)

enter image description here

But if Scanned at 600dpi the text is clearly single characters enter image description here The OCR will still make mistakes but less of them so i m is seen as a single W

So now if we use your source we can see that even cleaned up it will be prone to fail

enter image description here

Either single characters will be ignored or mis read Thus essential to run an editor spell checker on the results

enter image description here

Finally

as to quality of displaying letters as vectors this depends on the OCR application So this one has tidied up the words for accessibility readers, (still a few problems as described above) and generated the characters into a font suited to display as vectors (much like the Word conversion) but the errors will be just as noticeable because the source image is here not overlaid.