PLAYA-PDF Command Line Interface

PLAYA's CLI, which can get stuff out of PDFs for you.

This used to extract arbitrary properties of arbitrary graphical objects as a CSV, but for that you want PAVÉS now.

By default this will just print some hopefully useful metadata about all the pages and indirect objects in the PDF, as a JSON dictionary, not because we love JSON, but because it's built-in and easy to parse and we hate XML a lot more. This dictionary will always contain the following keys (but will probably contain more in the future):

pdf_version: self-explanatory
is_printable: whether you should be allowed to print this PDF
is_modifiable: whether you should be allowed to modify this PDF
is_extractable: whether you should be allowed to extract text from this PDF (LOL)
pages: list of descriptions of pages, containing:
- objid: the indirect object ID of the page descriptor
- label: a (possibly made up) page label
- mediabox: the boundaries of the page in default user space
- cropbox: the cropping box in default user space
- rotate: the rotation of the page in degrees (no radians for you)
objects: list of all indirect objects (including those in object streams, as well as the object streams themselves), containing:
- objid: the object number
- genno: the generation number
- type: the type of object this is
- obj: a best-effort JSON serialization of the object's metadata. In the case of simple objects like strings, dictionaries, or lists, this is the object itself. Object references are converted to a string representation of the form "", while content streams are reprented by their properties dictionary.

Bucking the trend of the last 20 years towards horribly slow Click-addled CLIs with deeply nested subcommands, anything else is just a command-line option away. You may for instance want to decode a particular (object, content, whatever) stream:

playa --stream 123 foo.pdf

Or recursively expand the document catalog into a horrible mess of JSON:

playa --catalog foo.pdf

You can look at the content streams for one or more or all pages:

playa --content-streams foo.pdf
playa --pages 1 --content-streams foo.pdf
playa --pages 3,4,9 --content-streams foo.pdf

And you can get the logical structure tree, including the text of content items (for properly tagged PDFs this is more useful than just getting the raw text):

playa --structure foo.pdf

You can even... sort of... use this to extract text (don't @ me). On the one hand you can get a torrent of JSON for one or more or all pages, with each fragment of text and all of its properties (position, font, color, etc):

playa --text-objects foo.pdf
playa --pages 4-6 --text-objects foo.pdf

But also, if you have a Tagged PDF, then in theory it has a defined reading order, and so we can actually really extract the text from it (this also works with untagged PDFs but your mileage may vary).

playa --text tagged-foo.pdf

And finally yes you can also extract images (not necessarily useful since they are frequently tiled and/or composited):

playa --images outdir foo.dir