Handling PDF files.
Nowadays the Portable Document Format (PDF) is one of the most commonly used data formats. The variaty of available solutions for Python-related PDF tools, modules, and libraries is really wide.
Python Knowledge Base: Make coding great again.
- Updated:
2024-09-12 by Andrey BRATUS, Senior Data Analyst.
Scraping a text from PDF file:
Scraping a table from PDF file and writing to CSV file:
Creating simple PDF document:
PyMuPDF (aka fitz) is a lightweight PDF and XPS viewer which is known for its top performance and high rendering quality, is used in examples below.
With PyMuPDF you can access files with extensions like ".pdf", ".xps", ".oxps", ".cbz", ".fb2" or ".epub". In addition, about 10 popular image formats can also be handled like documents: ".png", ".jpg", ".bmp", ".tiff", etc...
We will also use FPDF (“Free”-PDF) library for PDF document generation under Python and tabula-py, which can read tables in a PDF.
import fitz
with fitz.open("jokesonbratusnet.pdf") as pdf:
text = ''
for page in pdf:
text = text + page.get_text()
print(text)
OUT: Welcome to the jokes directoty and humor catalogue of the constantly updated internet fresh jokes database full of funny stuff !!!
Important: first you need to install tabula by - pip install tabula-py !!!
Tabula will create pandas dataframe which then will be written to csv.
import tabula
table = tabula.read_pdf('weather.pdf', pages=1)
# print(type(table[0]))
table[0].to_csv('output.csv', index=None)
print('csv was created !!!')
OUT: csv was created !!!
This simple script creates pdf file using your input like logo image, title and text.
Important: first you need to install fpdf by - pip install fpdf !!!
from fpdf import FPDF
pdf = FPDF(orientation='P', unit='pt', format='A4')
pdf.add_page()
pdf.image('logo.jpeg', w=60, h=35)
pdf.set_font(family='Times', style='B', size=24)
pdf.cell(w=0, h=50, txt="Hot Jokes catalogue", align='C', ln=1)
pdf.set_font(family='Times', style='B', size=14)
pdf.cell(w=0, h=15, txt='About this page:', ln=1)
pdf.set_font(family='Times', size=12)
txt1 = """Our goal is to create biggest and most exclusive jokes and puns collection database on the internet to share it with everybody for free.
Browse and search our latest and most popular jokes and puns organized by topics. But please be warned...some of our jokes are so dark I'm surprised that they haven't been shot by the police."""
pdf.multi_cell(w=0, h=15, txt=txt1)
pdf.set_font(family='Times', style='B', size=14)
pdf.cell(w=100, h=25, txt='Author:')
pdf.set_font(family='Times', size=14)
pdf.cell(w=100, h=25, txt='Bratus Andrey', ln=1)
pdf.set_font(family='Times', style='B', size=14)
pdf.cell(w=100, h=25, txt='Source:')
pdf.set_font(family='Times', size=14)
pdf.cell(w=100, h=25, txt='https://python-code.pro/', ln=1)
pdf.output('outputfile.pdf')
print('Your PDF is created.')
OUT: Your PDF is created.