This script is designed to break a PDF document into individual sections and save each section as a separate text file. Follow the steps below to use the script:
Ensure you have pdfplumber
installed. You can install it using pip:
pip install pdfplumber
Make sure your Python environment is set up and running. Save the script provided into a .py
file, for example, pdf_parser.py
.
Place the PDF document you want to parse in the same directory as the script or provide the full path to the document in the script.
Open the script file pdf_parser.py
in a text editor. Modify the following variables to match your PDF document:
pdf_document = "your_pdf_document.pdf" # Replace with your PDF filename
title_of_pdf = "Your PDF Title" # Replace with your PDF title
start_to_stop_pages_of_sections = [
# Update the page ranges and section names as needed
[3, 10, "Table of Contents"],
[11, 20, "Introduction"],
[21, 30, "Chapter 1"],
# Add more sections as required
]
In your terminal or command prompt, navigate to the directory containing the script and run:
python pdf_parser.py
The script will extract the text from the specified sections of the PDF and save each section as a text file in the same directory as the script. Each file will be named using the PDF title and section name, formatted in lowercase and with underscores instead of spaces (e.g., your_pdf_title_table_of_contents.txt
).
The script will print a confirmation message for each section extracted and a final message once all extractions are complete:
Text extraction for Table of Contents complete. Output saved to
your_pdf_title_table_of_contents.txt.
Text extraction for Introduction complete. Output saved to
your_pdf_title_introduction.txt.
...
All text extractions complete.
If your PDF document is named avionics.pdf
and your title is "Avionics Display Manual for IFD Series", the script setup might look like this:
pdf_document = "avionics.pdf"
title_of_pdf = "Avionics Display Manual for IFD Series"
start_to_stop_pages_of_sections = [
[3, 10, "Table of Contents"],
[9, 66, "1 System Overview"],
[67, 90, "2 SVS Subsystem"],
[91, 156, "3 FMS Subsystem"],
[157, 236, "4 Map Subsystem"],
[237, 336, "5 Aux Subsystem"],
[337, 378, "6 Navigation"],
[379, 466, "7 General"],
[467, 472, "Index"],
[473, 474, "Support and Contact Information"],
]
After running the script, you will find text files named avionics_display_manual_for_ifd_series_table_of_contents.txt
, avionics_display_manual_for_ifd_series_1_system_overview.txt
, and so on.
Ensure that the page numbers in the start_to_stop_pages_of_sections
list are correct and correspond to the sections you want to extract. The script uses UTF-8 encoding to save the text files. This should handle most characters correctly, but if your PDF contains special characters or non-standard fonts, additional handling may be required.
### AiWerkz.com
##
## PDF Parser
##
## July 30, 2028
## Robert L. Vaughn
import pdfplumber
import os
# User inputs
pdf_document = "avionics.pdf"
title_of_pdf = "Avionics Display Manual for IFD Series"
start_to_stop_pages_of_sections = [
[3, 10, "Table of Contents"],
[9, 66, "1 System Overview"],
[67, 90, "2 SVS Subsystem"],
[91, 156, "3 FMS Subsystem"],
[157, 236, "4 Map Subsystem"],
[237, 336, "5 Aux Subsystem"],
[337, 378, "6 Navigation"],
[379, 466, "7 General"],
[467, 472, "Index"],
[473, 474, "Support and Contact Information"],
]
# Convert title to lower case with underscores
base_filename = title_of_pdf.lower().replace(" ", "_")
# Function to extract text and save to file
def extract_text_and_save(pdf_path, base_filename, sections):
with pdfplumber.open(pdf_path) as pdf:
for start, end, section_name in sections:
text = ""
for i in range(start - 1, end): # Adjust for zero-based indexing
text += pdf.pages[i].extract_text()
# Format the filename
section_filename = f"{base_filename}_{section_name.lower().
replace(' ', '_')}.txt"
# Save the extracted text to a file using UTF-8 encoding
with open(section_filename, "w", encoding="utf-8") as text_file:
text_file.write(text)
print(f"Text extraction for {section_name} complete. Output
saved to {section_filename}.")
# Run the extraction
extract_text_and_save(pdf_document, base_filename, start_to_stop_pages_of_sections)
print("All text extractions complete.")