I have a scanned PDF document containing images with pixelated text. The OCR process has extracted the text, but it appears low quality and pixelated. I want to convert this pixelated text into a high-quality digital font or vector format, so it retains its clarity and smoothness. I have already attempted optical character recognition (OCR) and can copy the text, but it lacks the desired quality. The text in the scanned images looks jagged and blurry, making it challenging to read. I want to improve the text quality and convert it into a digital font or vector format that is crisp, clear, and non-pixelated. What steps and tools can I use to enhance the pixelated text in the scanned PDF images? Is there any specific software or technique that can help me achieve this? Additionally, what are the best practices for converting this improved text into a high-quality digital font or vector format? Any guidance or recommendations on image editing software, font digitization tools, or suitable workflows would be greatly appreciated. Thank you!
50.1k 9 9 gold badges 113 113 silver badges 143 143 bronze badges asked Jul 3, 2023 at 13:49 31 1 1 silver badge 2 2 bronze badgesAh, I think you'll find that (as the original poster noted) OCR is an abbreviation for Optical Character Recognition, not One Char Replacement. I doubt the OCR process has made the text pixellated and blurry, I imagine the original scanned image was like that, though without c file to look at, it's impossible to say for sure. Essentially, though, I agree with KJ that once the damage is done (poor quality scan) you can't do much to improve it. You could upscale with some kind of anti-aliasing to get a smoother result, but it's never going to be great.
Commented Jul 3, 2023 at 15:37From the posted image, it looks like the scan has used JPEG compression (also the back page seems to shine through), and that has the usual lossy compression artefacts. Your best bet is to rescan the content and use a lossless compression. Alternatively converting to black and white and adjusting the histogram so that anything below a quite dark value is mapped to white, and the remainder is mapped to black, then applying something like a 'sharp' filter might improve matters. Of course if there are any colour illustrations those will be totally broken by the process.
Commented Jul 3, 2023 at 16:10You have two issues…
You need to significantly increase the resolution of your scans.
I'm guessing you have 'Image over Text' which will present the file looking just like the original scan, but with hidden 'real' selectable text underneath. In a PDF reader it will look like this, with some text selected. The actual selection is not of the image, but of the hidden text underneath.
If you flip the presentation order to 'Text over Image' then you would instead see this…
Still terrible, because your scan is not properly readable [from issue 1.]
If you save as Text only, you would then see this…
I've enlarged this one so you can see that - though it's total garbage - it's at least sharp garbage. This is now entirely vector, no raster image at all, so it will always be sharp.
So, fixing issue 1 will then allow you to change issue 2 in order to preserve [legible] vector-based PDFs.
If you need to also preserve images, then you need to choose whether Image over Text or Text over Image looks best. Test a few pages of each type.
answered Jul 4, 2023 at 11:56 50.1k 9 9 gold badges 113 113 silver badges 143 143 bronze badgesOne of the best ways to improve a scanned source is to use the original again so here is that area as seen by a 200 DPI TIFF fax machine, where we are at the limits for recognising words.
However there should be no fixation of resolution. Here is the original screen at lower 96 DPI density. so it looks better for being pure colour tones without any JPG content or bleed through to confuse any OCR device.
The problem is when captured that 96 DPI looks like this in a computer program
However since it is clean it works well in an online OCR pixels to Words sharp Vector character processor, but will be better if a higher density such as 192 dpi.
So you may complain "Unfair you used a clean source scan" so as to illustrate your point, and that is the whole point, that a bad JPEG lousy scan is nowhere near as good to produce any meaningful result compared to a good fresh, even a lower density PNG style of scan.
Going back to resolution there is a problem area where here at 192 dpi the text is not clearly readable as single characters (OCR will attempt to replace characters one by one , then detect a word from those)
But if Scanned at 600dpi the text is clearly single characters The OCR will still make mistakes but less of them so i m is seen as a single W
So now if we use your source we can see that even cleaned up it will be prone to fail
Either single characters will be ignored or mis read Thus essential to run an editor spell checker on the results
as to quality of displaying letters as vectors this depends on the OCR application So this one has tidied up the words for accessibility readers, (still a few problems as described above) and generated the characters into a font suited to display as vectors (much like the Word conversion) but the errors will be just as noticeable because the source image is here not overlaid.