-
@bitsgalore May I suggest to add ocrmypdf https://ocrmypdf.readthedocs.io ? It's a wrapper of other python libraries and besides the text extraction (tesseract) is extremely good for pdf optimization https://ocrmypdf.readthedocs.io/en/latest/optimizer.html (and conversion to pdf/a). Also jbig encoding https://ocrmypdf.readthedocs.io/en/latest/jbig2.html is quite optimal for images of scanned text