Python + PDFs = Data Delight: Streamline Your Workflow!

Handling PDF files.

Nowadays the Portable Document Format (PDF) is one of the most commonly used data formats. The variaty of available solutions for Python-related PDF tools, modules, and libraries is really wide.

PDF files processing with Python.

Coding: The Silent Symphony. - Updated: 2024-05-18 by Andrey BRATUS, Senior Data Analyst.

PyMuPDF (aka fitz) is a lightweight PDF and XPS viewer which is known for its top performance and high rendering quality, is used in examples below.

With PyMuPDF you can access files with extensions like ".pdf", ".xps", ".oxps", ".cbz", ".fb2" or ".epub". In addition, about 10 popular image formats can also be handled like documents: ".png", ".jpg", ".bmp", ".tiff", etc...
We will also use FPDF (“Free”-PDF) library for PDF document generation under Python and tabula-py, which can read tables in a PDF.

Scraping a text from PDF file:

import fitz

with"jokesonbratusnet.pdf") as pdf:
  text = ''
  for page in pdf:
    text = text + page.get_text()

OUT: Welcome to the jokes directoty and humor catalogue of the constantly updated internet fresh jokes database full of funny stuff !!!

Scraping a table from PDF file and writing to CSV file:

Important: first you need to install tabula by - pip install tabula-py !!!
Tabula will create pandas dataframe which then will be written to csv.

import tabula

table = tabula.read_pdf('weather.pdf', pages=1)

# print(type(table[0]))

table[0].to_csv('output.csv', index=None)

print('csv was created !!!')

OUT: csv was created !!!

Creating simple PDF document:

This simple script creates pdf file using your input like logo image, title and text.
Important: first you need to install fpdf by - pip install fpdf !!!

from fpdf import FPDF 

pdf = FPDF(orientation='P', unit='pt', format='A4')

pdf.image('logo.jpeg', w=60, h=35)

pdf.set_font(family='Times', style='B', size=24)
pdf.cell(w=0, h=50, txt="Hot Jokes catalogue", align='C', ln=1)

pdf.set_font(family='Times', style='B', size=14)
pdf.cell(w=0, h=15, txt='About this page:', ln=1)

pdf.set_font(family='Times', size=12)
txt1 = """Our goal is to create biggest and most exclusive jokes and puns collection database on the internet to share it with everybody for free.

Browse and search our latest and most popular jokes and puns organized by topics. But please be warned...some of our jokes are so dark I'm surprised that they haven't been shot by the police."""
pdf.multi_cell(w=0, h=15, txt=txt1)

pdf.set_font(family='Times', style='B', size=14)
pdf.cell(w=100, h=25, txt='Author:')

pdf.set_font(family='Times', size=14)
pdf.cell(w=100, h=25, txt='Bratus Andrey', ln=1)

pdf.set_font(family='Times', style='B', size=14)
pdf.cell(w=100, h=25, txt='Source:')

pdf.set_font(family='Times', size=14)
pdf.cell(w=100, h=25, txt='', ln=1)


print('Your PDF is created.')

OUT: Your PDF is created.

See also related topics: