110 lines
4.2 KiB
Text
110 lines
4.2 KiB
Text
Metadata-Version: 2.4
|
|
Name: pdfminer.six
|
|
Version: 20251230
|
|
Summary: PDF parser and analyzer
|
|
Author: Yusuke Shinyama, Pieter Marsman
|
|
Author-email: Philippe Guglielmetti <pdfminer@goulu.net>
|
|
License-Expression: MIT
|
|
Project-URL: Homepage, https://github.com/pdfminer/pdfminer.six
|
|
Keywords: layout analysis,pdf converter,pdf parser,text mining
|
|
Classifier: Development Status :: 5 - Production/Stable
|
|
Classifier: Environment :: Console
|
|
Classifier: Intended Audience :: Developers
|
|
Classifier: Intended Audience :: Science/Research
|
|
Classifier: Programming Language :: Python
|
|
Classifier: Programming Language :: Python :: 3 :: Only
|
|
Classifier: Programming Language :: Python :: 3.10
|
|
Classifier: Programming Language :: Python :: 3.11
|
|
Classifier: Programming Language :: Python :: 3.12
|
|
Classifier: Programming Language :: Python :: 3.13
|
|
Classifier: Programming Language :: Python :: 3.14
|
|
Classifier: Topic :: Text Processing
|
|
Requires-Python: >=3.10
|
|
Description-Content-Type: text/markdown
|
|
License-File: LICENSE
|
|
Requires-Dist: charset-normalizer>=2.0.0
|
|
Requires-Dist: cryptography>=36.0.0
|
|
Provides-Extra: image
|
|
Requires-Dist: Pillow; extra == "image"
|
|
Dynamic: license-file
|
|
|
|
pdfminer.six
|
|
============
|
|
|
|
[](https://github.com/pdfminer/pdfminer.six/actions/workflows/actions.yml)
|
|
[](https://pypi.python.org/pypi/pdfminer.six/)
|
|
[](https://gitter.im/pdfminer-six/Lobby?utm_source=badge&utm_medium)
|
|
|
|
*We fathom PDF*
|
|
|
|
Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF
|
|
documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the
|
|
sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.
|
|
|
|
It is built in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own
|
|
interpreter or rendering device that uses the power of pdfminer.six for other purposes than text analysis.
|
|
|
|
Check out the full documentation on
|
|
[Read the Docs](https://pdfminersix.readthedocs.io).
|
|
|
|
|
|
Features
|
|
--------
|
|
|
|
* Written entirely in Python.
|
|
* Parse, analyze, and convert PDF documents.
|
|
* Extract content as text, images, html or [hOCR](https://en.wikipedia.org/wiki/HOCR).
|
|
* Support for PDF-1.7 specification (well, almost).
|
|
* Support for CJK languages and vertical writing.
|
|
* Support for various font types (Type1, TrueType, Type3, and CID) support.
|
|
* Support for extracting embedded images (JPG, PNG, TIFF, JBIG2, bitmaps).
|
|
* Support for decoding various compressions (ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode,
|
|
CCITTFaxDecode)
|
|
* Support for RC4 and AES encryption.
|
|
* Support for AcroForm interactive form extraction.
|
|
* Table of contents extraction.
|
|
* Tagged contents extraction.
|
|
* Automatic layout analysis.
|
|
|
|
How to use
|
|
----------
|
|
|
|
* Install Python 3.10 or newer.
|
|
* Install pdfminer.six.
|
|
```bash
|
|
pip install pdfminer.six
|
|
|
|
* (Optionally) install extra dependencies for extracting images.
|
|
|
|
```bash
|
|
pip install 'pdfminer.six[image]'
|
|
|
|
* Use the command-line interface to extract text from pdf.
|
|
|
|
```bash
|
|
pdf2txt.py example.pdf
|
|
|
|
* Or use it with Python.
|
|
```python
|
|
from pdfminer.high_level import extract_text
|
|
|
|
text = extract_text("example.pdf")
|
|
print(text)
|
|
```
|
|
|
|
Contributing
|
|
------------
|
|
|
|
We welcome contributions! Whether you want to fix a bug, add a feature, or improve documentation, your help is appreciated.
|
|
|
|
Please note that as a community-maintained project with limited maintainer availability, the best way to get an issue resolved is to submit a pull request yourself.
|
|
|
|
To get started:
|
|
1. Read [CONTRIBUTING.md](CONTRIBUTING.md) for setup instructions and coding standards
|
|
2. Check out the [open issues](https://github.com/pdfminer/pdfminer.six/issues) to find something to work on
|
|
3. Join the discussion on [Gitter](https://gitter.im/pdfminer-six/Lobby) if you have questions
|
|
|
|
Acknowledgement
|
|
---------------
|
|
|
|
This repository includes code from `pyHanko` ; the original license has been included [here](/docs/licenses/LICENSE.pyHanko).
|