Working in the PDF mine

pdfminer.six is widely used for text extraction and layout analysis due to its liberal licensing terms. Unfortunately it is quite slow and contains many bugs. Now you can use PLAYA instead:

from playa.miner import extract, LAParams

laparams = LAParams()
for page in extract(path, laparams):
    # do something

This is generally faster than pdfminer.six. You can often make it even faster on large documents by running in parallel with the max_workers argument, which is the same as the one you will find in concurrent.futures.ProcessPoolExecutor. If you pass None it will use all your CPUs, but due to some unavoidable overhead, it usually doesn't help to use more than 2-4:

for page in extract(path, laparams, max_workers=2):
    # do something

There are a few differences with pdfminer.six (some might call them bug fixes):

By default, if you do not pass the laparams argument to extract, no layout analysis at all is done. This is different from extract_pages in pdfminer.six which will set some default parameters for you. If you don't see any LTTextBox items in your LTPage then this is why!
Rectangles are recognized correctly in some cases where pdfminer.six thought they were "curves".
Colours and colour spaces are the PLAYA versions, which do not correspond to what pdfminer.six gives you, because what pdfminer.six gives you is not useful and often wrong.
You have access to the list of enclosing marked content sections in every LTComponent, as the mcstack attribute.
Bounding boxes of rotated glyphs are the actual bounding box.

Probably more... but you didn't use any of that stuff anyway, you just wanted to get LTTextBoxes to feed to your hallucination factories.

Reference

`playa.miner`

Reimplementation of pdfminer.six layout analysis on top of PLAYA.

`GraphicState` `dataclass`

PDF graphics state (PDF 1.7 section 8.4) including text state (PDF 1.7 section 9.3.1), but excluding coordinate transformations.

Contrary to the pretensions of pdfminer.six, the text state is for the most part not at all separate from the graphics state, and can be updated outside the confines of BT and ET operators, thus there is no advantage and only confusion that comes from treating it separately.

The only state that does not persist outside BT / ET pairs is the text coordinate space (line matrix and text rendering matrix), and it is also the only part that is updated during iteration over a TextObject.

For historical reasons the main coordinate transformation matrix, though it is also part of the graphics state, is also stored separately.

Attributes:

Name	Type	Description
`clipping_path`	`None`	The current clipping path (sec. 8.5.4)
`linewidth`	`float`	Line width in user space units (sec. 8.4.3.2)
`linecap`	`int`	Line cap style (sec. 8.4.3.3)
`linejoin`	`int`	Line join style (sec. 8.4.3.4)
`miterlimit`	`float`	Maximum length of mitered line joins (sec. 8.4.3.5)
`dash`	`DashPattern`	Dash pattern for stroking (sec 8.4.3.6)
`intent`	`PSLiteral`	Rendering intent (sec. 8.6.5.8)
`stroke_adjustment`	`bool`	A flag specifying whether to compensate for possible rasterization effects when stroking a path with a line width that is small relative to the pixel resolution of the output device (sec. 10.7.5)
`blend_mode`	`Union[PSLiteral, List[PSLiteral]]`	The current blend mode that shall be used in the transparent imaging model (sec. 11.3.5)
`smask`	`Union[None, Dict[str, PDFObject]]`	A soft-mask dictionary (sec. 11.6.5.1) or None
`salpha`	`float`	The constant shape or constant opacity value used for stroking operations (sec. 11.3.7.2 & 11.6.4.4)
`nalpha`	`float`	The constant shape or constant opacity value used for non-stroking operations
`alpha_source`	`bool`	A flag specifying whether the current soft mask and alpha constant parameters shall be interpreted as shape values (true) or opacity values (false). This flag also governs the interpretation of the SMask entry, if any, in an image dictionary
`black_pt_comp`	`PSLiteral`	The black point compensation algorithm that shall be used when converting CIE-based colours (sec. 8.6.5.9)
`flatness`	`float`	The precision with which curves shall be rendered on the output device (sec. 10.6.2)
`scolor`	`Color`	Colour used for stroking operations
`scs`	`ColorSpace`	Colour space used for stroking operations
`ncolor`	`Color`	Colour used for non-stroking operations
`ncs`	`ColorSpace`	Colour space used for non-stroking operations
`font`	`Union[Font, None]`	The current font.
`fontsize`	`float`	The "font size" parameter, which is not the font size in points as you might understand it, but rather a scaling factor applied to text space (so, it affects not only text size but position as well). Since most reasonable people find that behaviour rather confusing, this is often just 1.0, and PDFs rely on the text matrix to set the size of text.
`charspace`	`float`	Extra spacing to add after each glyph, expressed in unscaled text space units, meaning it is not affected by `fontsize`. BUT it will be modified by `scaling` for horizontal writing mode (so, most of the time).
`wordspace`	`float`	Extra spacing to add after a space glyph, defined very specifically as the glyph encoded by the single-byte character code 32 (SPOILER: it is probably a space). Also expressed in unscaled text space units, but modified by `scaling`.
`scaling`	`float`	The horizontal scaling factor as defined by the PDF standard (that is, divided by 100).
`leading`	`float`	The leading as defined by the PDF standard, in unscaled text space units.
`render_mode`	`int`	The PDF rendering mode. The really important one here is 3, which means "don't render the text". You might want to use this to detect invisible text.
`rise`	`float`	The text rise (superscript or subscript position), in unscaled text space units.
`knockout`	`bool`	The text knockout flag, shall determine the behaviour of overlapping glyphs within a text object in the transparent imaging model (sec. 9.3.8)

Source code in playa/content.py

@dataclass
class GraphicState:
    """PDF graphics state (PDF 1.7 section 8.4) including text state
    (PDF 1.7 section 9.3.1), but excluding coordinate transformations.

    Contrary to the pretensions of pdfminer.six, the text state is for
    the most part not at all separate from the graphics state, and can
    be updated outside the confines of `BT` and `ET` operators, thus
    there is no advantage and only confusion that comes from treating
    it separately.

    The only state that does not persist outside `BT` / `ET` pairs is
    the text coordinate space (line matrix and text rendering matrix),
    and it is also the only part that is updated during iteration over
    a `TextObject`.

    For historical reasons the main coordinate transformation matrix,
    though it is also part of the graphics state, is also stored
    separately.

    Attributes:
      clipping_path: The current clipping path (sec. 8.5.4)
      linewidth: Line width in user space units (sec. 8.4.3.2)
      linecap: Line cap style (sec. 8.4.3.3)
      linejoin: Line join style (sec. 8.4.3.4)
      miterlimit: Maximum length of mitered line joins (sec. 8.4.3.5)
      dash: Dash pattern for stroking (sec 8.4.3.6)
      intent: Rendering intent (sec. 8.6.5.8)
      stroke_adjustment: A flag specifying whether to compensate for
        possible rasterization effects when stroking a path with a line
        width that is small relative to the pixel resolution of the output
        device (sec. 10.7.5)
      blend_mode: The current blend mode that shall be used in the
        transparent imaging model (sec. 11.3.5)
      smask: A soft-mask dictionary (sec. 11.6.5.1) or None
      salpha: The constant shape or constant opacity value used for
        stroking operations (sec. 11.3.7.2 & 11.6.4.4)
      nalpha: The constant shape or constant opacity value used for
        non-stroking operations
      alpha_source: A flag specifying whether the current soft mask and
        alpha constant parameters shall be interpreted as shape values
        (true) or opacity values (false). This flag also governs the
        interpretation of the SMask entry, if any, in an image dictionary
      black_pt_comp: The black point compensation algorithm that shall be
        used when converting CIE-based colours (sec. 8.6.5.9)
      flatness: The precision with which curves shall be rendered on
        the output device (sec. 10.6.2)
      scolor: Colour used for stroking operations
      scs: Colour space used for stroking operations
      ncolor: Colour used for non-stroking operations
      ncs: Colour space used for non-stroking operations
      font: The current font.
      fontsize: The "font size" parameter, which is **not** the font
        size in points as you might understand it, but rather a
        scaling factor applied to text space (so, it affects not only
        text size but position as well).  Since most reasonable people
        find that behaviour rather confusing, this is often just 1.0,
        and PDFs rely on the text matrix to set the size of text.
      charspace: Extra spacing to add after each glyph, expressed in
        unscaled text space units, meaning it is not affected by
        `fontsize`.  **BUT** it will be modified by `scaling` for
        horizontal writing mode (so, most of the time).
      wordspace: Extra spacing to add after a space glyph, defined
        very specifically as the glyph encoded by the single-byte
        character code 32 (SPOILER: it is probably a space).  Also
        expressed in unscaled text space units, but modified by
        `scaling`.
      scaling: The horizontal scaling factor as defined by the PDF
        standard (that is, divided by 100).
      leading: The leading as defined by the PDF standard, in unscaled
        text space units.
      render_mode: The PDF rendering mode.  The really important one
        here is 3, which means "don't render the text".  You might
        want to use this to detect invisible text.
      rise: The text rise (superscript or subscript position), in
        unscaled text space units.
      knockout: The text knockout flag, shall determine the behaviour of
        overlapping glyphs within a text object in the transparent imaging
        model (sec. 9.3.8)

    """

    clipping_path: None = None  # TODO
    linewidth: float = 1
    linecap: int = 0
    linejoin: int = 0
    miterlimit: float = 10
    dash: DashPattern = SOLID_LINE
    intent: PSLiteral = LITERAL_RELATIVE_COLORIMETRIC
    stroke_adjustment: bool = False
    blend_mode: Union[PSLiteral, List[PSLiteral]] = LITERAL_NORMAL
    smask: Union[None, Dict[str, PDFObject]] = None
    salpha: float = 1
    nalpha: float = 1
    alpha_source: bool = False
    black_pt_comp: PSLiteral = LITERAL_DEFAULT
    flatness: float = 1
    scolor: Color = BASIC_BLACK
    scs: ColorSpace = PREDEFINED_COLORSPACE["DeviceGray"]
    ncolor: Color = BASIC_BLACK
    ncs: ColorSpace = PREDEFINED_COLORSPACE["DeviceGray"]
    font: Union[Font, None] = None
    fontsize: float = 0
    charspace: float = 0
    wordspace: float = 0
    scaling: float = 100
    leading: float = 0
    render_mode: int = 0
    rise: float = 0
    knockout: bool = True

`LAParams`

Parameters for layout analysis

Parameters:

Name	Type	Description	Default
`line_overlap`	`float`	If two characters have more overlap than this they are considered to be on the same line. The overlap is specified relative to the minimum height of both characters.	`0.5`
`char_margin`	`float`	If two characters are closer together than this margin they are considered part of the same line. The margin is specified relative to the width of the character.	`2.0`
`word_margin`	`float`	If two characters on the same line are further apart than this margin then they are considered to be two separate words, and an intermediate space will be added for readability. The margin is specified relative to the width of the character.	`0.1`
`line_margin`	`float`	If two lines are are close together they are considered to be part of the same paragraph. The margin is specified relative to the height of a line.	`0.5`
`boxes_flow`	`Union[float, None]`	Specifies how much a horizontal and vertical position of a text matters when determining the order of text boxes. The value should be within the range of -1.0 (only horizontal position matters) to +1.0 (only vertical position matters). You can also pass `None` to disable advanced layout analysis, and instead return text based on the position of the bottom left corner of the text box.	`0.5`
`detect_vertical`	`bool`	If vertical text should be considered during layout analysis	`False`
`all_texts`	`bool`	If layout analysis should be performed on text in figures.	`False`

Source code in playa/miner.py

class LAParams:
    """Parameters for layout analysis

    Args:
      line_overlap: If two characters have more overlap than this they
        are considered to be on the same line. The overlap is specified
        relative to the minimum height of both characters.
      char_margin: If two characters are closer together than this
        margin they are considered part of the same line. The margin is
        specified relative to the width of the character.
      word_margin: If two characters on the same line are further apart
        than this margin then they are considered to be two separate words, and
        an intermediate space will be added for readability. The margin is
        specified relative to the width of the character.
      line_margin: If two lines are are close together they are
        considered to be part of the same paragraph. The margin is
        specified relative to the height of a line.
      boxes_flow: Specifies how much a horizontal and vertical position
        of a text matters when determining the order of text boxes. The value
        should be within the range of -1.0 (only horizontal position
        matters) to +1.0 (only vertical position matters). You can also pass
        `None` to disable advanced layout analysis, and instead return text
        based on the position of the bottom left corner of the text box.
      detect_vertical: If vertical text should be considered during
        layout analysis
      all_texts: If layout analysis should be performed on text in
        figures.
    """

    def __init__(
        self,
        line_overlap: float = 0.5,
        char_margin: float = 2.0,
        line_margin: float = 0.5,
        word_margin: float = 0.1,
        boxes_flow: Union[float, None] = 0.5,
        detect_vertical: bool = False,
        all_texts: bool = False,
    ) -> None:
        self.line_overlap = line_overlap
        self.char_margin = char_margin
        self.line_margin = line_margin
        self.word_margin = word_margin
        self.boxes_flow = boxes_flow
        self.detect_vertical = detect_vertical
        self.all_texts = all_texts

        self._validate()

    def _validate(self) -> None:
        if self.boxes_flow is not None:
            boxes_flow_err_msg = (
                "LAParam boxes_flow should be None, or a number between -1 and +1"
            )
            if not isinstance(self.boxes_flow, (int, float)):
                raise PDFTypeError(boxes_flow_err_msg)
            if not -1 <= self.boxes_flow <= 1:
                raise PDFValueError(boxes_flow_err_msg)

    def __repr__(self) -> str:
        return (
            "<LAParams: char_margin=%.1f, line_margin=%.1f, "
            "word_margin=%.1f all_texts=%r>"
            % (self.char_margin, self.line_margin, self.word_margin, self.all_texts)
        )

`LTAnno`

Bases: LTItem, LTText

Actual letter in the text as a Unicode string.

Note that, while a LTChar object has actual boundaries, LTAnno objects does not, as these are "virtual" characters, inserted by a layout analyzer according to the relationship between two characters (e.g. a space).

Source code in playa/miner.py

class LTAnno(LTItem, LTText):
    """Actual letter in the text as a Unicode string.

    Note that, while a LTChar object has actual boundaries, LTAnno objects does
    not, as these are "virtual" characters, inserted by a layout analyzer
    according to the relationship between two characters (e.g. a space).
    """

    def __init__(self, text: Union[str, None] = None) -> None:
        if text is None:
            # No initialization, for pickling purposes
            return
        self._text = text

    def get_text(self) -> str:
        return self._text

`LTChar`

Bases: LTComponent, LTText

Actual letter in the text as a Unicode string.

Source code in playa/miner.py

class LTChar(LTComponent, LTText):
    """Actual letter in the text as a Unicode string."""

    def __init__(
        self,
        glyph: Union[GlyphObject, None] = None,
    ) -> None:
        super().__init__()
        if glyph is None:
            # No initialization, for pickling purposes
            return
        gstate = glyph.gstate
        matrix = glyph.matrix
        font = glyph.font
        if glyph.text is None:
            logger.debug("undefined: %r, %r", font, glyph.cid)
            # Horrible awful pdfminer.six behaviour
            self._text = "(cid:%d)" % glyph.cid
        else:
            self._text = glyph.text
        self.mcstack = glyph.mcstack
        self.fontname = font.fontname
        self.graphicstate = gstate
        self.render_mode = gstate.render_mode
        self.stroking_color = gstate.scolor
        self.non_stroking_color = gstate.ncolor
        self.scs = gstate.scs
        self.ncs = gstate.ncs
        scaling = gstate.scaling * 0.01
        fontsize = gstate.fontsize
        (a, b, c, d, e, f) = matrix
        # FIXME: Still really not sure what this means
        self.upright = a * d * scaling > 0 and b * c <= 0
        # Unscale the matrix to match pdfminer.six
        xscale = 1 / (fontsize * scaling)
        yscale = 1 / fontsize
        self.matrix = (a * xscale, b * yscale, c * xscale, d * yscale, e, f)
        # Recreate pdfminer.six's bogus bboxes
        if font.vertical:
            vdisp = font.vdisp(glyph.cid)
            self.adv = vdisp * fontsize
            vx, vy = font.position(glyph.cid)
            textbox = (-vx, vy + vdisp, -vx + 1, vy)
        else:
            textwidth = font.hdisp(glyph.cid)
            self.adv = textwidth * fontsize * scaling
            descent = font.descent * font.matrix[3]
            textbox = (0, descent, textwidth, descent + 1)
        miner_box = transform_bbox(matrix, textbox)
        super().__init__(miner_box, glyph.mcstack)
        # FIXME: This is quite wrong for rotated glyphs, but so is pdfminer.six
        if font.vertical:
            self.size = self.width
        else:
            self.size = self.height

    def __repr__(self) -> str:
        return (
            f"<{self.__class__.__name__} {bbox2str(self.bbox)} "
            f"matrix={matrix2str(self.matrix)} font={self.fontname!r} "
            f"adv={self.adv} text={self.get_text()!r}>"
        )

    def get_text(self) -> str:
        return self._text

`LTComponent`

Bases: LTItem

Object with a bounding box

Source code in playa/miner.py

class LTComponent(LTItem):
    """Object with a bounding box"""

    def __init__(
        self, bbox: Union[Rect, None] = None, mcstack: Tuple[MarkedContent, ...] = ()
    ) -> None:
        if bbox is None:
            # No initialization, for pickling purposes (see
            # https://mypyc.readthedocs.io/en/latest/differences_from_python.html#pickling-and-copying-objects)
            return
        self.set_bbox(bbox)
        self.mcstack = mcstack

    def __repr__(self) -> str:
        return f"<{self.__class__.__name__} {bbox2str(self.bbox)}>"

    def set_bbox(self, bbox: Rect) -> None:
        (x0, y0, x1, y1) = bbox
        self.x0 = x0
        self.y0 = y0
        self.x1 = x1
        self.y1 = y1
        self.width = x1 - x0
        self.height = y1 - y0
        self.bbox = bbox

    def is_empty(self) -> bool:
        return self.width <= 0 or self.height <= 0

    def is_hoverlap(self, obj: "LTComponent") -> bool:
        return obj.x0 <= self.x1 and self.x0 <= obj.x1

    def hdistance(self, obj: "LTComponent") -> float:
        if self.is_hoverlap(obj):
            return 0
        else:
            return min(abs(self.x0 - obj.x1), abs(self.x1 - obj.x0))

    def hoverlap(self, obj: "LTComponent") -> float:
        if self.is_hoverlap(obj):
            return min(abs(self.x0 - obj.x1), abs(self.x1 - obj.x0))
        else:
            return 0

    def is_voverlap(self, obj: "LTComponent") -> bool:
        return obj.y0 <= self.y1 and self.y0 <= obj.y1

    def vdistance(self, obj: "LTComponent") -> float:
        if self.is_voverlap(obj):
            return 0
        else:
            return min(abs(self.y0 - obj.y1), abs(self.y1 - obj.y0))

    def voverlap(self, obj: "LTComponent") -> float:
        if self.is_voverlap(obj):
            return min(abs(self.y0 - obj.y1), abs(self.y1 - obj.y0))
        else:
            return 0

`LTContainer`

Bases: LTComponent, Generic[LTItemT]

Object that can be extended and analyzed

Source code in playa/miner.py

class LTContainer(LTComponent, Generic[LTItemT]):
    """Object that can be extended and analyzed"""

    def __init__(
        self, bbox: Union[Rect, None] = None, mcstack: Tuple[MarkedContent, ...] = ()
    ) -> None:
        if bbox is None:
            # No initialization, for pickling purposes
            return
        super().__init__(bbox, mcstack)
        self._objs: List[LTItemT] = []

    def __iter__(self) -> Iterator[LTItemT]:
        return iter(self._objs)

    def __len__(self) -> int:
        return len(self._objs)

    def add(self, obj: LTItemT) -> None:
        self._objs.append(obj)

    def extend(self, objs: Iterable[LTItemT]) -> None:
        for obj in objs:
            self.add(obj)

    def analyze(self, laparams: LAParams) -> None:
        for obj in self._objs:
            obj.analyze(laparams)

`LTCurve`

Bases: LTComponent

A generic Bezier curve

The parameter original_path contains the original pathing information from the pdf (e.g. for reconstructing Bezier Curves).

dashing_style contains the Dashing information if any.

Source code in playa/miner.py

class LTCurve(LTComponent):
    """A generic Bezier curve

    The parameter `original_path` contains the original
    pathing information from the pdf (e.g. for reconstructing Bezier Curves).

    `dashing_style` contains the Dashing information if any.
    """

    def __init__(
        self,
        path: Union[PathObject, None] = None,
        pts: List[Point] = [],  # These are actually immutable so not a problem
        transformed_path: List[PathSegment] = [],
    ) -> None:
        if path is None:
            # No initialization, for pickling purposes
            return
        super().__init__(get_bound(pts), path.mcstack)
        self.pts = pts
        self.linewidth = path.gstate.linewidth
        self.stroke = path.stroke
        self.fill = path.fill
        self.evenodd = path.evenodd
        gstate = path.gstate
        self.graphicstate = gstate
        self.stroking_color = gstate.scolor
        self.non_stroking_color = gstate.ncolor
        self.scs = gstate.scs
        self.ncs = gstate.ncs
        self.original_path = transformed_path
        self.dashing_style = gstate.dash

    def get_pts(self) -> str:
        return ",".join("%.3f,%.3f" % p for p in self.pts)

`LTFigure`

Bases: LTLayoutContainer

Represents an area used by PDF Form objects.

PDF Forms can be used to present figures or pictures by embedding yet another PDF document within a page. Note that LTFigure objects can appear recursively.

Source code in playa/miner.py

class LTFigure(LTLayoutContainer):
    """Represents an area used by PDF Form objects.

    PDF Forms can be used to present figures or pictures by embedding yet
    another PDF document within a page. Note that LTFigure objects can appear
    recursively.
    """

    def __init__(self, obj: Union[ImageObject, XObjectObject, None] = None) -> None:
        if obj is None:
            # No initialization, for pickling purposes
            return
        if obj.xobjid is None:
            self.name = str(id(obj))
        else:
            self.name = obj.xobjid
        self.matrix = obj.ctm
        super().__init__(obj.bbox, obj.mcstack)

    def __repr__(self) -> str:
        return (
            f"<{self.__class__.__name__}({self.name}) "
            f"{bbox2str(self.bbox)} matrix={matrix2str(self.matrix)}>"
        )

    def analyze(self, laparams: LAParams) -> None:
        if not laparams.all_texts:
            return
        LTLayoutContainer.analyze(self, laparams)

`LTImage`

Bases: LTComponent

An image object.

Embedded images can be in JPEG, Bitmap or JBIG2.

Source code in playa/miner.py

class LTImage(LTComponent):
    """An image object.

    Embedded images can be in JPEG, Bitmap or JBIG2.
    """

    def __init__(self, obj: Union[ImageObject, None] = None) -> None:
        if obj is None:
            # No initialization, for pickling purposes
            return
        super().__init__(obj.bbox, obj.mcstack)
        # Inline images don't actually have an xobjid, so we make shit
        # up like pdfminer.six does.
        if obj.xobjid is None:
            self.name = str(id(obj))
        else:
            self.name = obj.xobjid
        self.stream = obj.stream
        self.srcsize = obj.srcsize
        self.imagemask = obj.imagemask
        self.bits = obj.bits
        self.colorspace = obj.colorspace

    def __repr__(self) -> str:
        return (
            f"<{self.__class__.__name__}({self.name})"
            f" {bbox2str(self.bbox)} {self.srcsize!r}>"
        )

`LTItem`

Interface for things that can be analyzed

Source code in playa/miner.py

@trait
class LTItem:
    """Interface for things that can be analyzed"""

    def analyze(self, laparams: LAParams) -> None:
        """Perform the layout analysis."""

`analyze(laparams)`

Perform the layout analysis.

Source code in playa/miner.py

def analyze(self, laparams: LAParams) -> None:
    """Perform the layout analysis."""

`LTLayoutContainer`

Bases: LTContainer[LTComponent]

Source code in playa/miner.py

class LTLayoutContainer(LTContainer[LTComponent]):
    def __init__(
        self, bbox: Union[Rect, None] = None, mcstack: Tuple[MarkedContent, ...] = ()
    ) -> None:
        if bbox is None:
            # No initialization, for pickling purposes
            return
        super().__init__(bbox, mcstack)
        self.groups: Union[List[LTTextGroup], None] = None

    # group_objects: group text object to textlines.
    def group_objects(
        self,
        laparams: LAParams,
        objs: Iterable[LTComponent],
    ) -> Iterator[LTTextLine]:
        obj0: Any = None
        line: Any = None
        for obj1 in objs:
            if obj0 is not None:
                # halign: obj0 and obj1 is horizontally aligned.
                #
                #   +------+ - - -
                #   | obj0 | - - +------+   -
                #   |      |     | obj1 |   | (line_overlap)
                #   +------+ - - |      |   -
                #          - - - +------+
                #
                #          |<--->|
                #        (char_margin)
                halign = (
                    obj0.is_voverlap(obj1)
                    and min(obj0.height, obj1.height) * laparams.line_overlap
                    < obj0.voverlap(obj1)
                    and obj0.hdistance(obj1)
                    < max(obj0.width, obj1.width) * laparams.char_margin
                )

                # valign: obj0 and obj1 is vertically aligned.
                #
                #   +------+
                #   | obj0 |
                #   |      |
                #   +------+ - - -
                #     |    |     | (char_margin)
                #     +------+ - -
                #     | obj1 |
                #     |      |
                #     +------+
                #
                #     |<-->|
                #   (line_overlap)
                valign = (
                    laparams.detect_vertical
                    and obj0.is_hoverlap(obj1)
                    and min(obj0.width, obj1.width) * laparams.line_overlap
                    < obj0.hoverlap(obj1)
                    and obj0.vdistance(obj1)
                    < max(obj0.height, obj1.height) * laparams.char_margin
                )

                if (halign and isinstance(line, LTTextLineHorizontal)) or (
                    valign and isinstance(line, LTTextLineVertical)
                ):
                    line.add(obj1)
                elif line is not None:
                    yield line
                    line = None
                elif valign and not halign:
                    line = LTTextLineVertical(laparams.word_margin)
                    line.add(obj0)
                    line.add(obj1)
                elif halign and not valign:
                    line = LTTextLineHorizontal(laparams.word_margin)
                    line.add(obj0)
                    line.add(obj1)
                else:
                    line = LTTextLineHorizontal(laparams.word_margin)
                    line.add(obj0)
                    yield line
                    line = None
            obj0 = obj1
        if line is None:
            line = LTTextLineHorizontal(laparams.word_margin)
            assert obj0 is not None
            line.add(obj0)
        yield line

    def group_textlines(
        self,
        laparams: LAParams,
        lines: Iterable[LTTextLine],
    ) -> Iterator[LTTextBox]:
        """Group neighboring lines to textboxes"""
        plane: Plane[LTTextLine] = Plane(self.bbox)
        plane.extend(lines)
        boxes: Dict[int, LTTextBox] = {}
        for line in lines:
            neighbors = line.find_neighbors(plane, laparams.line_margin)
            members = [line]
            for obj1 in neighbors:
                members.append(obj1)
                if id(obj1) in boxes:
                    members.extend(boxes[id(obj1)])
                    del boxes[id(obj1)]
            if isinstance(line, LTTextLineHorizontal):
                box: LTTextBox = LTTextBoxHorizontal()
            else:
                box = LTTextBoxVertical()
            for obj in uniq(members):
                box.add(obj)
                boxes[id(obj)] = box
        done: Set[int] = set()
        for line in lines:
            if id(line) not in boxes:
                continue
            box = boxes[id(line)]
            if id(box) in done:
                continue
            done.add(id(box))
            if not box.is_empty():
                yield box

    def group_textboxes(
        self,
        laparams: LAParams,
        boxes: Sequence[LTTextBox],
    ) -> List[LTTextGroup]:
        """Group textboxes hierarchically.

        Get pair-wise distances, via dist func defined below, and then merge
        from the closest textbox pair. Once obj1 and obj2 are merged /
        grouped, the resulting group is considered as a new object, and its
        distances to other objects & groups are added to the process queue.

        For performance reason, pair-wise distances and object pair info are
        maintained in a heap of (idx, dist, id(obj1), id(obj2), obj1, obj2)
        tuples. It ensures quick access to the smallest element. Note that
        since comparison operators, e.g., __lt__, are disabled for
        LTComponent, id(obj) has to appear before obj in element tuples.

        Args:
          laparams: LAParams object.
          boxes: All textbox objects to be grouped.
        Returns:
          a list that has only one element, the final top level group.
        """
        ElementT = Union[LTTextBox, LTTextGroup]
        plane: Plane[ElementT] = Plane(self.bbox)

        def dist(obj1: LTComponent, obj2: LTComponent) -> float:
            """A distance function between two TextBoxes.

            Consider the bounding rectangle for obj1 and obj2.
            Return its area less the areas of obj1 and obj2,
            shown as 'www' below. This value may be negative.
                    +------+..........+ (x1, y1)
                    | obj1 |wwwwwwwwww:
                    +------+www+------+
                    :wwwwwwwwww| obj2 |
            (x0, y0) +..........+------+
            """
            x0 = min(obj1.x0, obj2.x0)
            y0 = min(obj1.y0, obj2.y0)
            x1 = max(obj1.x1, obj2.x1)
            y1 = max(obj1.y1, obj2.y1)
            return (
                (x1 - x0) * (y1 - y0)
                - obj1.width * obj1.height
                - obj2.width * obj2.height
            )

        def isany(obj1: ElementT, obj2: ElementT) -> bool:
            """Check if there's any other object between obj1 and obj2."""
            x0 = min(obj1.x0, obj2.x0)
            y0 = min(obj1.y0, obj2.y0)
            x1 = max(obj1.x1, obj2.x1)
            y1 = max(obj1.y1, obj2.y1)
            for obj in plane.find((x0, y0, x1, y1)):
                if obj not in (obj1, obj2):
                    break
            else:
                return False
            return True

        # If there's only one box, no grouping need be done, but we
        # should still always return a group!
        if len(boxes) == 1:
            return [LTTextGroup(boxes)]

        dists: List[Tuple[bool, float, int, int, ElementT, ElementT]] = []
        for i in range(len(boxes)):
            box1 = boxes[i]
            for j in range(i + 1, len(boxes)):
                box2 = boxes[j]
                dists.append((False, dist(box1, box2), id(box1), id(box2), box1, box2))
        heapq.heapify(dists)

        plane.extend(boxes)
        done: Set[int] = set()
        while len(dists) > 0:
            (skip_isany, d, id1, id2, obj1, obj2) = heapq.heappop(dists)
            # Skip objects that are already merged
            if (id1 in done) or (id2 in done):
                continue
            if not skip_isany and isany(obj1, obj2):
                heapq.heappush(dists, (True, d, id1, id2, obj1, obj2))
                continue
            if isinstance(obj1, (LTTextBoxVertical, LTTextGroupTBRL)) or isinstance(
                obj2,
                (LTTextBoxVertical, LTTextGroupTBRL),
            ):
                group: LTTextGroup = LTTextGroupTBRL([obj1, obj2])
            else:
                group = LTTextGroupLRTB([obj1, obj2])
            plane.remove(obj1)
            done.add(id1)
            plane.remove(obj2)
            done.add(id2)

            for other in plane:
                heapq.heappush(
                    dists,
                    (False, dist(group, other), id(group), id(other), group, other),
                )
            plane.add(group)
        # The plane should now only contain groups, otherwise it's a bug
        groups: List[LTTextGroup] = []
        for g in plane:
            assert isinstance(g, LTTextGroup)
            groups.append(g)
        return groups

    def analyze(self, laparams: LAParams) -> None:
        # textobjs is a list of LTChar objects, i.e.
        # it has all the individual characters in the page.
        (textobjs, otherobjs) = fsplit(lambda obj: isinstance(obj, LTChar), self)
        for obj in otherobjs:
            obj.analyze(laparams)
        if not textobjs:
            return
        textlines = list(self.group_objects(laparams, textobjs))
        (empties, textlines) = fsplit(lambda obj: obj.is_empty(), textlines)
        for obj in empties:
            obj.analyze(laparams)
        textboxes = list(self.group_textlines(laparams, textlines))
        if laparams.boxes_flow is None:
            for textbox in textboxes:
                textbox.analyze(laparams)

            def getkey(box: LTTextBox) -> Tuple[int, float, float]:
                if isinstance(box, LTTextBoxVertical):
                    return (0, -box.x1, -box.y0)
                else:
                    return (1, -box.y0, box.x0)

            textboxes.sort(key=getkey)
        else:
            self.groups = self.group_textboxes(laparams, textboxes)
            assigner = IndexAssigner()
            for group in self.groups:
                group.analyze(laparams)
                assigner.run(group)
            textboxes.sort(key=lambda box: box.index)
        self._objs = [*textboxes, *otherobjs, *empties]

`group_textboxes(laparams, boxes)`

Group textboxes hierarchically.

Get pair-wise distances, via dist func defined below, and then merge from the closest textbox pair. Once obj1 and obj2 are merged / grouped, the resulting group is considered as a new object, and its distances to other objects & groups are added to the process queue.

For performance reason, pair-wise distances and object pair info are maintained in a heap of (idx, dist, id(obj1), id(obj2), obj1, obj2) tuples. It ensures quick access to the smallest element. Note that since comparison operators, e.g., lt, are disabled for LTComponent, id(obj) has to appear before obj in element tuples.

Parameters:

Name	Type	Description	Default
`laparams`	`LAParams`	LAParams object.	required
`boxes`	`Sequence[LTTextBox]`	All textbox objects to be grouped.	required

Returns: a list that has only one element, the final top level group.

Source code in playa/miner.py

def group_textboxes(
    self,
    laparams: LAParams,
    boxes: Sequence[LTTextBox],
) -> List[LTTextGroup]:
    """Group textboxes hierarchically.

    Get pair-wise distances, via dist func defined below, and then merge
    from the closest textbox pair. Once obj1 and obj2 are merged /
    grouped, the resulting group is considered as a new object, and its
    distances to other objects & groups are added to the process queue.

    For performance reason, pair-wise distances and object pair info are
    maintained in a heap of (idx, dist, id(obj1), id(obj2), obj1, obj2)
    tuples. It ensures quick access to the smallest element. Note that
    since comparison operators, e.g., __lt__, are disabled for
    LTComponent, id(obj) has to appear before obj in element tuples.

    Args:
      laparams: LAParams object.
      boxes: All textbox objects to be grouped.
    Returns:
      a list that has only one element, the final top level group.
    """
    ElementT = Union[LTTextBox, LTTextGroup]
    plane: Plane[ElementT] = Plane(self.bbox)

    def dist(obj1: LTComponent, obj2: LTComponent) -> float:
        """A distance function between two TextBoxes.

        Consider the bounding rectangle for obj1 and obj2.
        Return its area less the areas of obj1 and obj2,
        shown as 'www' below. This value may be negative.
                +------+..........+ (x1, y1)
                | obj1 |wwwwwwwwww:
                +------+www+------+
                :wwwwwwwwww| obj2 |
        (x0, y0) +..........+------+
        """
        x0 = min(obj1.x0, obj2.x0)
        y0 = min(obj1.y0, obj2.y0)
        x1 = max(obj1.x1, obj2.x1)
        y1 = max(obj1.y1, obj2.y1)
        return (
            (x1 - x0) * (y1 - y0)
            - obj1.width * obj1.height
            - obj2.width * obj2.height
        )

    def isany(obj1: ElementT, obj2: ElementT) -> bool:
        """Check if there's any other object between obj1 and obj2."""
        x0 = min(obj1.x0, obj2.x0)
        y0 = min(obj1.y0, obj2.y0)
        x1 = max(obj1.x1, obj2.x1)
        y1 = max(obj1.y1, obj2.y1)
        for obj in plane.find((x0, y0, x1, y1)):
            if obj not in (obj1, obj2):
                break
        else:
            return False
        return True

    # If there's only one box, no grouping need be done, but we
    # should still always return a group!
    if len(boxes) == 1:
        return [LTTextGroup(boxes)]

    dists: List[Tuple[bool, float, int, int, ElementT, ElementT]] = []
    for i in range(len(boxes)):
        box1 = boxes[i]
        for j in range(i + 1, len(boxes)):
            box2 = boxes[j]
            dists.append((False, dist(box1, box2), id(box1), id(box2), box1, box2))
    heapq.heapify(dists)

    plane.extend(boxes)
    done: Set[int] = set()
    while len(dists) > 0:
        (skip_isany, d, id1, id2, obj1, obj2) = heapq.heappop(dists)
        # Skip objects that are already merged
        if (id1 in done) or (id2 in done):
            continue
        if not skip_isany and isany(obj1, obj2):
            heapq.heappush(dists, (True, d, id1, id2, obj1, obj2))
            continue
        if isinstance(obj1, (LTTextBoxVertical, LTTextGroupTBRL)) or isinstance(
            obj2,
            (LTTextBoxVertical, LTTextGroupTBRL),
        ):
            group: LTTextGroup = LTTextGroupTBRL([obj1, obj2])
        else:
            group = LTTextGroupLRTB([obj1, obj2])
        plane.remove(obj1)
        done.add(id1)
        plane.remove(obj2)
        done.add(id2)

        for other in plane:
            heapq.heappush(
                dists,
                (False, dist(group, other), id(group), id(other), group, other),
            )
        plane.add(group)
    # The plane should now only contain groups, otherwise it's a bug
    groups: List[LTTextGroup] = []
    for g in plane:
        assert isinstance(g, LTTextGroup)
        groups.append(g)
    return groups

`group_textlines(laparams, lines)`

Group neighboring lines to textboxes

Source code in playa/miner.py

def group_textlines(
    self,
    laparams: LAParams,
    lines: Iterable[LTTextLine],
) -> Iterator[LTTextBox]:
    """Group neighboring lines to textboxes"""
    plane: Plane[LTTextLine] = Plane(self.bbox)
    plane.extend(lines)
    boxes: Dict[int, LTTextBox] = {}
    for line in lines:
        neighbors = line.find_neighbors(plane, laparams.line_margin)
        members = [line]
        for obj1 in neighbors:
            members.append(obj1)
            if id(obj1) in boxes:
                members.extend(boxes[id(obj1)])
                del boxes[id(obj1)]
        if isinstance(line, LTTextLineHorizontal):
            box: LTTextBox = LTTextBoxHorizontal()
        else:
            box = LTTextBoxVertical()
        for obj in uniq(members):
            box.add(obj)
            boxes[id(obj)] = box
    done: Set[int] = set()
    for line in lines:
        if id(line) not in boxes:
            continue
        box = boxes[id(line)]
        if id(box) in done:
            continue
        done.add(id(box))
        if not box.is_empty():
            yield box

`LTLine`

Bases: LTCurve

A single straight line.

Could be used for separating text or figures.

Source code in playa/miner.py

class LTLine(LTCurve):
    """A single straight line.

    Could be used for separating text or figures.
    """

    def __init__(
        self,
        path: Union[PathObject, None] = None,
        p0: Point = (0, 0),
        p1: Point = (0, 0),
        transformed_path: List[PathSegment] = [],
    ) -> None:
        if path is None:
            # No initialization, for pickling purposes
            return
        LTCurve.__init__(
            self,
            path,
            [p0, p1],
            transformed_path,
        )

`LTPage`

Bases: LTLayoutContainer

Represents an entire page.

Like any other LTLayoutContainer, an LTPage can be iterated to obtain child objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.

Source code in playa/miner.py

class LTPage(LTLayoutContainer):
    """Represents an entire page.

    Like any other LTLayoutContainer, an LTPage can be iterated to obtain child
    objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.
    """

    def __init__(self, pageid: int, bbox: Rect, rotate: float = 0) -> None:
        super().__init__(bbox, ())
        self.pageid = pageid
        self.rotate = rotate

    def __repr__(self) -> str:
        return (
            f"<{self.__class__.__name__}({self.pageid!r}) "
            f"{bbox2str(self.bbox)} rotate={self.rotate!r}>"
        )

`LTRect`

Bases: LTCurve

A rectangle.

Could be used for framing another pictures or figures.

Source code in playa/miner.py

class LTRect(LTCurve):
    """A rectangle.

    Could be used for framing another pictures or figures.
    """

    def __init__(
        self,
        path: Union[PathObject, None] = None,
        bbox: Rect = (0, 0, 0, 0),
        transformed_path: List[PathSegment] = [],
    ) -> None:
        if path is None:
            # No initialization, for pickling purposes
            return
        (x0, y0, x1, y1) = bbox
        LTCurve.__init__(
            self,
            path,
            [(x0, y0), (x1, y0), (x1, y1), (x0, y1)],
            transformed_path,
        )

`LTText`

Interface for things that have text

Source code in playa/miner.py

@trait
class LTText:
    """Interface for things that have text"""

    def __repr__(self) -> str:
        return f"<{self.__class__.__name__} {self.get_text()!r}>"

    def get_text(self) -> str:
        """Text contained in this object"""
        raise NotImplementedError

`get_text()`

Text contained in this object

Source code in playa/miner.py

def get_text(self) -> str:
    """Text contained in this object"""
    raise NotImplementedError

`LTTextBox`

Bases: LTTextContainer

Represents a group of text chunks in a rectangular area.

Note that this box is created by geometric analysis and does not necessarily represents a logical boundary of the text. It contains a list of LTTextLine objects.

Source code in playa/miner.py

class LTTextBox(LTTextContainer):
    """Represents a group of text chunks in a rectangular area.

    Note that this box is created by geometric analysis and does not
    necessarily represents a logical boundary of the text. It contains a list
    of LTTextLine objects.
    """

    def __init__(self) -> None:
        super().__init__()
        self.index: int = -1

    def __repr__(self) -> str:
        return (
            f"<{self.__class__.__name__}({self.index}) "
            f"{bbox2str(self.bbox)} {self.get_text()!r}>"
        )

    def get_writing_mode(self) -> str:
        raise NotImplementedError

`LTTextLine`

Bases: LTTextContainer

Contains a list of LTChar objects that represent a single text line.

The characters are aligned either horizontally or vertically, depending on the text's writing mode.

Source code in playa/miner.py

class LTTextLine(LTTextContainer):
    """Contains a list of LTChar objects that represent a single text line.

    The characters are aligned either horizontally or vertically, depending on
    the text's writing mode.
    """

    def __init__(self, word_margin: float = 0.0) -> None:
        super().__init__()
        self.word_margin = word_margin

    def __repr__(self) -> str:
        return f"<{self.__class__.__name__} {bbox2str(self.bbox)} {self.get_text()!r}>"

    def analyze(self, laparams: LAParams) -> None:
        for obj in self._objs:
            obj.analyze(laparams)
        # FIXME: Should probably inherit mcstack somehow
        LTContainer.add(self, LTAnno("\n"))

    def find_neighbors(
        self,
        plane: Plane[LTComponentT],
        ratio: float,
    ) -> List["LTTextLine"]:
        raise NotImplementedError

    def is_empty(self) -> bool:
        return super().is_empty() or self.get_text().isspace()

`LTTextLineHorizontal`

Bases: LTTextLine

Source code in playa/miner.py

class LTTextLineHorizontal(LTTextLine):
    def __init__(self, word_margin: float = 0.0) -> None:
        super().__init__(word_margin)
        self._x1 = +INF + 0.0

    # Incompatible override: we take an LTComponent (with bounding box), but
    # LTContainer only considers LTItem (no bounding box).
    def add(self, obj: LTComponent) -> None:  # type: ignore[override]
        if isinstance(obj, LTChar) and self.word_margin:
            margin = self.word_margin * max(obj.width, obj.height)
            if self._x1 < obj.x0 - margin:
                # FIXME: Should probably inherit mcstack somehow
                LTContainer.add(self, LTAnno(" "))
        self._x1 = obj.x1
        super().add(obj)

    def find_neighbors(
        self,
        plane: Plane[LTComponentT],
        ratio: float,
    ) -> List[LTTextLine]:
        """Finds neighboring LTTextLineHorizontals in the plane.

        Returns a list of other LTTestLineHorizontals in the plane which are
        close to self. "Close" can be controlled by ratio. The returned objects
        will be the same height as self, and also either left-, right-, or
        centrally-aligned.
        """
        d = ratio * self.height
        objs = plane.find((self.x0, self.y0 - d, self.x1, self.y1 + d))
        return [
            obj
            for obj in objs
            if (
                isinstance(obj, LTTextLineHorizontal)
                and self._is_same_height_as(obj, tolerance=d)
                and (
                    self._is_left_aligned_with(obj, tolerance=d)
                    or self._is_right_aligned_with(obj, tolerance=d)
                    or self._is_centrally_aligned_with(obj, tolerance=d)
                )
            )
        ]

    def _is_left_aligned_with(self, other: LTComponent, tolerance: float = 0.0) -> bool:
        """Whether the left-hand edge of `other` is within `tolerance`."""
        return abs(other.x0 - self.x0) <= tolerance

    def _is_right_aligned_with(
        self, other: LTComponent, tolerance: float = 0.0
    ) -> bool:
        """Whether the right-hand edge of `other` is within `tolerance`."""
        return abs(other.x1 - self.x1) <= tolerance

    def _is_centrally_aligned_with(
        self,
        other: LTComponent,
        tolerance: float = 0,
    ) -> bool:
        """Whether the horizontal center of `other` is within `tolerance`."""
        return abs((other.x0 + other.x1) / 2 - (self.x0 + self.x1) / 2) <= tolerance

    def _is_same_height_as(self, other: LTComponent, tolerance: float = 0) -> bool:
        return abs(other.height - self.height) <= tolerance

`find_neighbors(plane, ratio)`

Finds neighboring LTTextLineHorizontals in the plane.

Returns a list of other LTTestLineHorizontals in the plane which are close to self. "Close" can be controlled by ratio. The returned objects will be the same height as self, and also either left-, right-, or centrally-aligned.

Source code in playa/miner.py

def find_neighbors(
    self,
    plane: Plane[LTComponentT],
    ratio: float,
) -> List[LTTextLine]:
    """Finds neighboring LTTextLineHorizontals in the plane.

    Returns a list of other LTTestLineHorizontals in the plane which are
    close to self. "Close" can be controlled by ratio. The returned objects
    will be the same height as self, and also either left-, right-, or
    centrally-aligned.
    """
    d = ratio * self.height
    objs = plane.find((self.x0, self.y0 - d, self.x1, self.y1 + d))
    return [
        obj
        for obj in objs
        if (
            isinstance(obj, LTTextLineHorizontal)
            and self._is_same_height_as(obj, tolerance=d)
            and (
                self._is_left_aligned_with(obj, tolerance=d)
                or self._is_right_aligned_with(obj, tolerance=d)
                or self._is_centrally_aligned_with(obj, tolerance=d)
            )
        )
    ]

`LTTextLineVertical`

Bases: LTTextLine

Source code in playa/miner.py

class LTTextLineVertical(LTTextLine):
    def __init__(self, word_margin: float = 0.0) -> None:
        super().__init__(word_margin)
        self._y0: float = -INF + 0.0

    # Incompatible override: we take an LTComponent (with bounding box), but
    # LTContainer only considers LTItem (no bounding box).
    def add(self, obj: LTComponent) -> None:  # type: ignore[override]
        if isinstance(obj, LTChar) and self.word_margin:
            margin = self.word_margin * max(obj.width, obj.height)
            if obj.y1 + margin < self._y0:
                # FIXME: Should probably inherit mcstack somehow
                LTContainer.add(self, LTAnno(" "))
        self._y0 = obj.y0
        super().add(obj)

    def find_neighbors(
        self,
        plane: Plane[LTComponentT],
        ratio: float,
    ) -> List[LTTextLine]:
        """Finds neighboring LTTextLineVerticals in the plane.

        Returns a list of other LTTextLineVerticals in the plane which are
        close to self. "Close" can be controlled by ratio. The returned objects
        will be the same width as self, and also either upper-, lower-, or
        centrally-aligned.
        """
        d = ratio * self.width
        objs = plane.find((self.x0 - d, self.y0, self.x1 + d, self.y1))
        return [
            obj
            for obj in objs
            if (
                isinstance(obj, LTTextLineVertical)
                and self._is_same_width_as(obj, tolerance=d)
                and (
                    self._is_lower_aligned_with(obj, tolerance=d)
                    or self._is_upper_aligned_with(obj, tolerance=d)
                    or self._is_centrally_aligned_with(obj, tolerance=d)
                )
            )
        ]

    def _is_lower_aligned_with(self, other: LTComponent, tolerance: float = 0) -> bool:
        """Whether the lower edge of `other` is within `tolerance`."""
        return abs(other.y0 - self.y0) <= tolerance

    def _is_upper_aligned_with(self, other: LTComponent, tolerance: float = 0) -> bool:
        """Whether the upper edge of `other` is within `tolerance`."""
        return abs(other.y1 - self.y1) <= tolerance

    def _is_centrally_aligned_with(
        self,
        other: LTComponent,
        tolerance: float = 0,
    ) -> bool:
        """Whether the vertical center of `other` is within `tolerance`."""
        return abs((other.y0 + other.y1) / 2 - (self.y0 + self.y1) / 2) <= tolerance

    def _is_same_width_as(self, other: LTComponent, tolerance: float) -> bool:
        return abs(other.width - self.width) <= tolerance

`find_neighbors(plane, ratio)`

Finds neighboring LTTextLineVerticals in the plane.

Returns a list of other LTTextLineVerticals in the plane which are close to self. "Close" can be controlled by ratio. The returned objects will be the same width as self, and also either upper-, lower-, or centrally-aligned.

Source code in playa/miner.py

def find_neighbors(
    self,
    plane: Plane[LTComponentT],
    ratio: float,
) -> List[LTTextLine]:
    """Finds neighboring LTTextLineVerticals in the plane.

    Returns a list of other LTTextLineVerticals in the plane which are
    close to self. "Close" can be controlled by ratio. The returned objects
    will be the same width as self, and also either upper-, lower-, or
    centrally-aligned.
    """
    d = ratio * self.width
    objs = plane.find((self.x0 - d, self.y0, self.x1 + d, self.y1))
    return [
        obj
        for obj in objs
        if (
            isinstance(obj, LTTextLineVertical)
            and self._is_same_width_as(obj, tolerance=d)
            and (
                self._is_lower_aligned_with(obj, tolerance=d)
                or self._is_upper_aligned_with(obj, tolerance=d)
                or self._is_centrally_aligned_with(obj, tolerance=d)
            )
        )
    ]

`NameTree`

Bases: Mapping[bytes, PDFObject]

A PDF name tree.

See Section 7.9.6 of the PDF 1.7 Reference.

Raises:

Type	Description
`TypeError`	If initialized with a non-dictionary.

Source code in playa/data_structures.py

class NameTree(Mapping[bytes, PDFObject]):
    """A PDF name tree.

    See Section 7.9.6 of the PDF 1.7 Reference.

    Raises:
        TypeError: If initialized with a non-dictionary.
    """

    def __init__(self, obj: PDFObject):
        self._obj = dict_value(obj)

    def __len__(self) -> int:
        return sum(1 for _ in self)

    def __iter__(self) -> Iterator[bytes]:
        for name, _ in walk_name_tree(self._obj):
            yield name

    def __getitem__(self, key: bytes) -> PDFObject:
        for name, val in walk_name_tree(self._obj, key):
            if name == key:
                return val
        raise KeyError("Name %r not in tree" % key)

    def items(self) -> NameTreeItemsView:
        return NameTreeItemsView(self)

`NumberTree`

Bases: Mapping[int, PDFObject]

A PDF number tree.

See Section 7.9.7 of the PDF 1.7 Reference.

Raises:

Type	Description
`TypeError`	If initialized with a non-dictionary.

Source code in playa/data_structures.py

class NumberTree(Mapping[int, PDFObject]):
    """A PDF number tree.

    See Section 7.9.7 of the PDF 1.7 Reference.

    Raises:
        TypeError: If initialized with a non-dictionary.
    """

    def __init__(self, obj: PDFObject):
        self._obj = dict_value(obj)

    def __len__(self) -> int:
        return sum(1 for _ in self)

    def __iter__(self) -> Iterator[int]:
        for idx, _ in walk_number_tree(self._obj):
            yield idx

    def __getitem__(self, num: int) -> PDFObject:
        for idx, val in walk_number_tree(self._obj, num):
            if idx == num:
                return val
        raise KeyError(f"Number {num} not in tree")

    def items(self) -> NumberTreeItemsView:
        return NumberTreeItemsView(self)

`PDFDocument`

Bases: Mapping[int, PDFObject]

Representation of a PDF document.

PDF documents, at a basic level, are collections of indirect objects with numeric IDs. Since these IDs are sparse, and do not need to be ordered, this is best represented as a mapping of int to PDFObject. The specification also provides for "generation numbers" which can be used to track successive revisions of the same object. In practice very few PDFs actually do this, so the most recent generation of an object is accessible simply by its object ID. To find other objects, use the objects property.

Since PDF documents can be very large and complex, merely creating a Document does very little aside from verifying that the password is correct and getting a minimal amount of metadata. In general, PLAYA will try to open just about anything as a PDF, so you should not expect the constructor to fail here if you give it nonsense (something else may fail later on).

Some metadata, such as the structure tree and page tree, will be loaded lazily and cached. We do not handle modification of PDFs.

Parameters:

Name	Type	Description	Default
`fp`	`Union[BinaryIO, bytes]`	File-like object in binary mode, or a buffer with binary data. Files will be read using `mmap` if possible. They do not need to be seekable, as if `mmap` fails the entire file will simply be read into memory (so a pipe or socket ought to work).	required
`password`	`str`	Password for decryption, if needed.	`''`
`space`	`DeviceSpace`	the device space to use for interpreting content ("screen" or "page")	`'screen'`

Raises:

Type	Description
`TypeError`	if `fp` is a file opened in text mode (don't do that!)
`PDFEncryptionError`	if the PDF has an unsupported encryption scheme
`PDFPasswordIncorrect`	if the password is incorrect

Source code in playa/document.py

class Document(Mapping[int, PDFObject]):
    """Representation of a PDF document.

    PDF documents, at a basic level, are collections of indirect
    objects with numeric IDs.  Since these IDs are sparse, and do not
    need to be ordered, this is best represented as a mapping of `int`
    to `PDFObject`.  The specification also provides for "generation
    numbers" which can be used to track successive revisions of the
    same object.  In practice very few PDFs actually do this, so the
    most recent generation of an object is accessible simply by its
    object ID.  To find other objects, use the `objects` property.

    Since PDF documents can be very large and complex, merely creating
    a `Document` does very little aside from verifying that the
    password is correct and getting a minimal amount of metadata.  In
    general, PLAYA will try to open just about anything as a PDF, so
    you should not expect the constructor to fail here if you give it
    nonsense (something else may fail later on).

    Some metadata, such as the structure tree and page tree, will be
    loaded lazily and cached.  We do not handle modification of PDFs.

    Args:
      fp: File-like object in binary mode, or a buffer with binary data.
          Files will be read using `mmap` if possible.  They do not need
          to be seekable, as if `mmap` fails the entire file will simply
          be read into memory (so a pipe or socket ought to work).
      password: Password for decryption, if needed.
      space: the device space to use for interpreting content ("screen"
          or "page")

    Raises:
      TypeError: if `fp` is a file opened in text mode (don't do that!)
      PDFEncryptionError: if the PDF has an unsupported encryption scheme
      PDFPasswordIncorrect: if the password is incorrect

    """

    trailer: Dict[str, PDFObject]
    info: Dict[str, PDFObject]
    buffer: Union[bytes, mmap.mmap]  # FIXME: abstract type for this?
    space: DeviceSpace
    encryption: Union[Tuple[Tuple[bytes, bytes], Dict], None] = None
    decipher: Union[DecipherCallable, None] = None

    _fp: Union[BinaryIO, None] = None
    _pages: Union["PageList", None] = None
    _pool: Union[Executor, None] = None
    _catalog: Union[Dict[str, PDFObject], None] = None
    _outline: Union["Outlines", None] = None
    _destinations: Union["Destinations", None] = None
    _structure: Union["Tree", None] = None
    _fontmap: Union[Mapping[str, Font], None] = None
    _parser: Union[IndirectObjectParser, None] = None
    _xrefs: Union[List[XRef], None] = None
    _trailer_pos = -1
    _startxref_pos = -1

    def __enter__(self) -> "Document":
        return self

    def __exit__(self, exc_type, exc_value, traceback) -> None:
        self.close()

    def close(self) -> None:
        # If we were opened from a file then close it
        if self._fp:
            self._fp.close()
            self._fp = None
        # Shutdown process pool
        if self._pool:
            self._pool.shutdown()
            self._pool = None

    def __init__(
        self,
        fp: Union[BinaryIO, bytes],
        password: str = "",
        space: DeviceSpace = "screen",
        _boss_id: int = 0,
    ) -> None:
        # Get this out of the way, eh
        if isinstance(fp, io.TextIOBase):
            raise TypeError("fp is not a binary file")

        if _boss_id:
            # Set this **right away** because it is needed to get
            # indirect object references right.
            _set_document(self, _boss_id)
            assert in_worker()

        # Initialize mutable properties
        self.space = space
        self.info = {}
        self._cached_objs: Dict[int, PDFObject] = {}
        self._parsed_objs: Dict[int, Tuple[List[PDFObject], int]] = {}
        self._cached_fonts: Dict[int, Font] = {}
        self._cached_inline_images: Dict[
            Tuple[int, int], Tuple[int, Union[InlineImage, None]]
        ] = {}
        self._pdf_version, self._offset, self.buffer = _open_input(fp)
        # These are always True unless "encryption" (lol) is present
        self.is_printable = self.is_modifiable = self.is_extractable = True
        # We are Lazy, only find and read the trailer.
        self.trailer = self._read_trailer()
        # If there is encryption, then we need to read xref tables.
        # Otherwise we will defer this to the first object lookup.
        if "Encrypt" in self.trailer:
            self._xrefs = self._read_xrefs()
            try:
                ids = list_value(self.trailer["ID"])
                id_value = (bytes(ids[0]), bytes(ids[1]))
            except (KeyError, TypeError):
                # Some documents may not have a /ID, use two empty
                # byte strings instead. Solves
                # https://github.com/pdfminer/pdfminer.six/issues/594
                id_value = (b"", b"")
            encrypt = dict_value(self.trailer["Encrypt"])
            self.encryption = (id_value, encrypt)
            self._initialize_password(password)

    def _read_trailer(self) -> Dict[str, Any]:
        # To read the trailer, we must first find the trailer, which
        # is supposed to be at the end of the file, immediately before
        # the "startxref" keyword and after a "trailer" keyword.  This,
        # like so many other things in the PDF standard, is a cruel
        # lie, because:
        #
        # 1. If the file only contains cross-reference streams, there
        #    is no "trailer" keyword, and the trailer is the stream
        #    dictionary (which could be anywhere in the file)
        # 2. If the file is a linearized PDF, there *is* a trailer at
        #    the end of the file, but it's a bogus one that only
        #    contains /Size. The real trailer is after the main xref
        #    table (or stream) pointed to by the "startxref" value.
        # 3. Of course nobody understood these rules and so you might
        #    find the trailer in various other places.  Also, the
        #    "startxref" value is probably wrong.
        end = len(self.buffer)
        indobj = -1
        for pos in range(len(self.buffer) - 1, -2, -1):
            if pos == -1 or self.buffer[pos] in NOTKEYWORD:
                token = self.buffer[pos + 1 : end]
                if token == b"startxref":
                    try:
                        _, val = next(ObjectParser(self.buffer, pos=end))
                    except StopIteration:
                        continue
                    self._startxref_pos = int_value(val)
                    self._startxref_pos += self._offset
                    # If this is an xref stream, then its dictionary
                    # is the trailer.
                    if m := INDOBJR.match(self.buffer, self._startxref_pos):
                        self._trailer_pos = m.end(0)
                        break
                    # If this is a normal xref table, then look for a
                    # trailer after it, which will be the correct one
                    # to use (because linearization)
                    if m := XREFR.match(self.buffer, self._startxref_pos):
                        self._trailer_pos = self.buffer.find(
                            b"trailer", self._startxref_pos
                        )
                        if self._trailer_pos != -1:
                            self._trailer_pos += 7
                            break
                if token == b"trailer":
                    # We continued to scan backwards and found a trailer
                    self._trailer_pos = end
                if token == b"obj":
                    # We continued to scan backwards and found an
                    # indirect object, which may or may not be the
                    # cross-reference stream.
                    if self._trailer_pos == -1:
                        self._trailer_pos = end
                    indobj = 0
                if token == b"xref":
                    # We continued to scan backwards and found an xref table
                    self._startxref_pos = pos + 1
                    break
                if indobj != -1 and ord("0") <= token[0] <= ord("9"):
                    # We are in the abovementioned indirect object
                    indobj += 1
                    if indobj == 2:
                        self._startxref_pos = pos + 1
                        break
                end = pos
        self._trailer_pos, trailer = next(
            ObjectParser(self.buffer, pos=self._trailer_pos, doc=self)
        )
        if not isinstance(trailer, dict):
            raise PDFSyntaxError(f"Trailer is not a dict: {trailer!r}")
        if indobj == 2 and trailer.get("Type") != LITERAL_XREF:
            # Either it's just the trailer (no problem) and there's no
            # xref table, or it's some other random indirect object.
            self._startxref_pos = -1
        return trailer

    def _initialize_password(self, password: str = "") -> None:
        """Initialize the decryption handler with a given password, if any.

        Internal function, requires the Encrypt dictionary to have
        been read from the trailer into self.encryption.
        """
        assert self.encryption is not None
        (docid, param) = self.encryption
        if literal_name(param.get("Filter")) != "Standard":
            raise PDFEncryptionError("Unknown filter: param=%r" % param)
        v = int_value(param.get("V", 0))
        # 3 (PDF 1.4) An unpublished algorithm that permits encryption
        # key lengths ranging from 40 to 128 bits. This value shall
        # not appear in a conforming PDF file.
        if v == 3:
            raise PDFEncryptionError("Unpublished algorithm 3 not supported")
        factory = SECURITY_HANDLERS.get(v)
        # 0 An algorithm that is undocumented. This value shall not be used.
        if factory is None:
            raise PDFEncryptionError("Unknown algorithm: param=%r" % param)
        handler = factory(docid, param, password)
        self.decipher = handler.decrypt
        self.is_printable = handler.is_printable
        self.is_modifiable = handler.is_modifiable
        self.is_extractable = handler.is_extractable
        # Ensure that no extra data leaks into encrypted streams
        self.parser.strict = True
        self.parser.decipher = self.decipher

    @property
    def parser(self) -> IndirectObjectParser:
        if self._parser is not None:
            return self._parser
        self._parser = IndirectObjectParser(self.buffer, doc=self)
        self._parser.seek(self._offset)
        return self._parser

    @property
    def xrefs(self) -> List[XRef]:
        if self._xrefs is not None:
            return self._xrefs
        self._xrefs = self._read_xrefs()
        size = sum(len(x) for x in self._xrefs)
        self.trailer["Size"] = size
        log.debug("Updated /Size in trailer to %d", size)
        return self._xrefs

    def _read_xrefs(self) -> List[XRef]:
        if self._startxref_pos == -1:
            log.warning("startxref was not found, falling back to object parser")
            return [XRefFallback(self)]
        self._xrefpos: Set[int] = set()
        xrefs: List[XRef] = []
        try:
            self._read_xrefs_into(self._startxref_pos, xrefs)
            return xrefs
        except (ValueError, IndexError, StopIteration, PDFSyntaxError) as e:
            log.warning("xref parsing failed, falling back to object parser: %s", e)
            return [XRefFallback(self)]

    def _read_xrefs_into(
        self,
        start: int,
        xrefs: List[XRef],
    ) -> None:
        """Reads XRefs from the given location."""
        if start in self._xrefpos:
            log.warning("Detected circular xref chain at %d", start)
            return
        # Look for an XRefStream first, then an XRefTable
        if INDOBJR.match(self.buffer, start):
            log.debug("Reading xref stream at %d", start)
            # XRefStream: PDF-1.5
            xref: XRef = XRefStream(self, pos=start, offset=self._offset)
        elif m := XREFR.match(self.buffer, start):
            log.debug("Reading xref table at %d", m.start(1))
            xref = XRefTable(self, pos=m.start(1), offset=self._offset)
        else:
            # Well, maybe it's an XRef table without "xref" (but
            # probably not)
            xref = XRefTable(self, pos=start, offset=self._offset)
        self._xrefpos.add(start)
        xrefs.append(xref)
        trailer = xref.trailer
        # For hybrid-reference files, an additional set of xrefs as a
        # stream.
        if "XRefStm" in trailer:
            pos = int_value(trailer["XRefStm"])
            self._read_xrefs_into(pos + self._offset, xrefs)
        # Recurse into any previous xref tables or streams
        if "Prev" in trailer:
            # find previous xref
            pos = int_value(trailer["Prev"])
            self._read_xrefs_into(pos + self._offset, xrefs)

    @property
    def catalog(self) -> Dict[str, Any]:
        if self._catalog is not None:
            return self._catalog
        self._catalog = {}
        if "Root" in self.trailer:
            # Every PDF file must have exactly one /Root dictionary.
            try:
                self._catalog = dict_value(self.trailer["Root"])
            except TypeError:
                log.warning("Root is a broken reference (incorrect xref table?)")
        else:
            log.warning("No /Root object! - Is this really a PDF?")
        if "Type" in self._catalog and self._catalog["Type"] is not LITERAL_CATALOG:
            log.warning(f"Catalog doesn't seem to be a catalog: {self._catalog!r}")
        return self._catalog

    @property
    def is_tagged(self) -> bool:
        markinfo = resolve1(self.catalog.get("MarkInfo"))
        if isinstance(markinfo, dict):
            return not not markinfo.get("Marked")
        return False

    @property
    def pdf_version(self) -> str:
        if "Version" in self.catalog:
            log.debug(
                "Using PDF version %r from catalog instead of %r from header",
                self.catalog["Version"],
                self._pdf_version,
            )
            return literal_name(self.catalog["Version"])
        return self._pdf_version

    def __len__(self) -> int:
        """Return the number of indirect objects in this PDF.

        Danger: This number is unreliable and ephemeral.
            In a conforming PDF, the number of objects is declared by
            the `/Size` key in the document trailer, but conforming
            PDFs do not exist in the real world.  Upon opening a PDF,
            the trailer value will be returned here, but once the
            cross-reference tables have been loaded, a *different*
            number will be returned, and in the case where the
            cross-reference tables are invalid and must be
            regenerated, this value will be updated yet again.  This
            is the price of laziness (not to be confused with the
            wages of sin).
        """
        size = self.trailer.get("Size", None)
        if isinstance(size, int):
            return size
        size = sum(len(x) for x in self.xrefs)
        self.trailer["Size"] = size
        log.debug("Updated /Size in trailer to %d", size)
        return size

    def __iter__(self) -> Iterator[int]:
        """Iterate over object IDs"""
        return itertools.chain.from_iterable(self.xrefs)

    @property
    def objects(self) -> Iterator[IndirectObject]:
        """Iterate over all indirect objects (including, then expanding object
        streams)"""
        for _, obj in IndirectObjectParser(
            self.buffer, self, pos=self._offset, strict=self.parser.strict
        ):
            yield obj
            if (
                isinstance(obj.obj, ContentStream)
                and obj.obj.get("Type") is LITERAL_OBJSTM
            ):
                parser = ObjectStreamParser(obj.obj, self)
                for _, sobj in parser:
                    yield sobj

    @property
    def tokens(self) -> Iterator[Token]:
        """Iterate over tokens."""
        return (tok for pos, tok in Lexer(self.buffer))

    @property
    def structure(self) -> Union[Tree, None]:
        """Logical structure of this document, if any.

        In the case where no logical structure tree exists, this will
        be `None`.  Otherwise you may iterate over it, search it, etc.

        We do this instead of simply returning an empty structure tree
        because the vast majority of PDFs have no logical structure.
        Also, because the structure is a lazy object (the type
        signature here may change to `Iterable[Element]` at some
        point) there is no way to know if it's empty without iterating
        over it.

        """
        if self._structure is not None:
            return self._structure
        try:
            self._structure = Tree(self)
        except (TypeError, KeyError):
            self._structure = None
        return self._structure

    def _getobj_objstm(
        self, stream: ContentStream, index: int, objid: int
    ) -> PDFObject:
        if stream.objid in self._parsed_objs:
            (objs, n) = self._parsed_objs[stream.objid]
        else:
            (objs, n) = self._get_objects(stream)
            assert stream.objid is not None
            self._parsed_objs[stream.objid] = (objs, n)
        i = n * 2 + index
        try:
            obj = objs[i]
        except IndexError as e:
            raise PDFSyntaxError(
                "index %d + %d too big for stream of %d objects"
                % (n * 2, index, len(objs))
            ) from e
        return obj

    def _get_objects(self, stream: ContentStream) -> Tuple[List[PDFObject], int]:
        if stream.get("Type") is not LITERAL_OBJSTM:
            log.warning("Content stream Type is not /ObjStm: %r" % stream)
        try:
            n = int_value(stream["N"])
        except KeyError:
            log.warning("N is not defined in content stream: %r" % stream)
            n = 0
        except TypeError:
            log.warning("N is invalid in content stream: %r" % stream)
            n = 0
        parser = ObjectParser(stream.buffer, self)
        objs: List[PDFObject] = [obj for _, obj in parser]
        return (objs, n)

    def _getobj_parse(self, pos: int, objid: int) -> PDFObject:
        self.parser.seek(pos)
        try:
            m = INDOBJR.match(self.buffer, pos)
            if m is None:
                raise PDFSyntaxError(
                    f"Not an indirect object at position {pos}: "
                    f"{self.buffer[pos : pos + 8]!r}"
                )
            _, obj = next(self.parser)
            if obj.objid != objid:
                raise PDFSyntaxError(f"objid mismatch: {obj.objid!r}={objid!r}")
        except (ValueError, IndexError, PDFSyntaxError) as e:
            raise PDFSyntaxError(
                "Indirect object %d not found at position %d"
                % (
                    objid,
                    pos,
                )
            ) from e
        if obj.objid != objid:
            raise PDFSyntaxError(f"objid mismatch: {obj.objid!r}={objid!r}")
        return obj.obj

    def __getitem__(self, objid: int) -> PDFObject:
        """Get an indirect object from the PDF.

        Note that the behaviour in the case of a non-existent object
        (raising `KeyError`), while Pythonic, is not PDFic, as PDF
        1.7 sec 7.3.10 states:

        > An indirect reference to an undefined object shall not be
        considered an error by a conforming reader; it shall be
        treated as a reference to the null object.

        Raises:
          ValueError: if Document cannot be initialized
          KeyError: if objid does not exist in PDF

        """
        if objid == 0:
            raise KeyError("PDF object id cannot be 0.")

        if objid in self._cached_objs:
            if self._cached_objs[objid] is None:
                raise KeyError(f"Object with ID {objid} not found")
            return self._cached_objs[objid]
        obj = None

        for xref in self.xrefs:
            try:
                (strmid, index, genno) = xref[objid]
            except KeyError:
                continue
            try:
                if strmid is not None:
                    stream = stream_value(self[strmid])
                    obj = self._getobj_objstm(stream, index, objid)
                else:
                    try:
                        obj = self._getobj_parse(index, objid)
                    except PDFSyntaxError as e:
                        log.warning(
                            "Indirect object %d not found at position %d: %r",
                            objid,
                            index,
                            e,
                        )
                        # xref tables are clearly borked, so
                        # rebuild them and try again
                        log.warning("Rebuilding xref table from object parser")
                        fallback = XRefFallback(self)
                        self.trailer["Size"] = len(fallback)
                        log.debug(
                            "Updated /Size in trailer to %d", self.trailer["Size"]
                        )
                        self._xrefs = [fallback]
                        try:
                            (strmid, index, genno) = self._xrefs[0][objid]
                            obj = self._getobj_parse(index, objid)
                        except (KeyError, PDFSyntaxError) as e:
                            log.warning(
                                "Indirect object %d STILL not found at position %d: %r",
                                objid,
                                index,
                                e,
                            )
                break
            # FIXME: We might not actually want to catch these...
            except StopIteration:
                log.debug("EOF when searching for object %d", objid)
                continue
            except PDFSyntaxError as e:
                log.debug("Syntax error when searching for object %d: %s", objid, e)
                continue

        # Store it anyway as None if we can't find it to avoid costly searching
        self._cached_objs[objid] = obj
        if obj is None:
            raise KeyError(f"Object with ID {objid} not found")
        return self._cached_objs[objid]

    def get_font(
        self, objid: int = 0, spec: Union[Dict[str, PDFObject], None] = None
    ) -> Font:
        if objid and objid in self._cached_fonts:
            return self._cached_fonts[objid]
        if spec is None:
            return Font({}, {})
        # Create a Font object, hopefully
        font: Union[Font, None] = None
        if spec.get("Type") is not LITERAL_FONT:
            log.warning("Font Type is not /Font: %r", spec)
        subtype = spec.get("Subtype")
        if subtype in (LITERAL_TYPE1, LITERAL_MMTYPE1):
            font = Type1Font(spec)
        elif subtype is LITERAL_TRUETYPE:
            font = TrueTypeFont(spec)
        elif subtype == LITERAL_TYPE3:
            font = Type3Font(spec)
        elif subtype == LITERAL_TYPE0:
            if "DescendantFonts" not in spec:
                log.warning("Type0 font has no DescendantFonts: %r", spec)
            else:
                dfonts = list_value(spec["DescendantFonts"])
                if len(dfonts) != 1:
                    log.debug(
                        "Type 0 font should have 1 descendant, has more: %r", dfonts
                    )
                subspec = resolve1(dfonts[0])
                if not isinstance(subspec, dict):
                    log.warning("Invalid descendant font: %r", subspec)
                else:
                    subspec = subspec.copy()
                    # Merge the root and descendant font dictionaries
                    for k in ("Encoding", "ToUnicode"):
                        if k in spec:
                            subspec[k] = resolve1(spec[k])
                    font = CIDFont(subspec)
        else:
            log.warning("Unknown Subtype in font: %r" % spec)
        if font is None:
            # We need a dummy font object to be able to do *something*
            # (even if it's the wrong thing) with text objects.
            font = Font({}, {})
        if objid:
            self._cached_fonts[objid] = font
        return font

    @property
    def fonts(self) -> Mapping[str, Font]:
        """Get the mapping of font names to fonts for this document.

        Note that this can be quite slow the first time it's accessed
        as it must scan every single page in the document.

        Note: Font names may collide.
            Font names are generally understood to be globally unique
            <del>in the neighbourhood</del> in the document, but there's no
            guarantee that this is the case.  In keeping with the
            "incremental update" philosophy dear to PDF, you get the
            last font defined with a given name.
        """
        if self._fontmap is not None:
            return self._fontmap
        self._fontmap: Mapping[str, Font] = FontMapping(self)
        return self._fontmap

    @property
    def outline(self) -> Union[Outlines, None]:
        """Document outline, if any."""
        if "Outlines" not in self.catalog:
            return None
        if self._outline is not None:
            return self._outline
        try:
            self._outline = Outlines(self)
        except TypeError:
            log.warning(
                "Invalid Outlines entry in catalog: %r", self.catalog["Outlines"]
            )
            return None
        return self._outline

    @property
    def page_labels(self) -> Union[Iterator[str], None]:
        """Iterate over page label strings for the PDF document.

        If the document includes page labels, this generates strings,
        one per page, otherwise it is None.

        Warning: Unbounded iterator
            This iterator is unbounded, because the page label tree
            has no relation to the actual page tree, so it is
            recommended to use `pages` instead.
        """
        if self.catalog is None:
            return None
        label_tree_obj = self.catalog.get("PageLabels")
        if label_tree_obj is None:
            return None
        label_tree = NumberTree(label_tree_obj)
        return _iter_labels(label_tree)

    @property
    def pages(self) -> "PageList":
        """Pages of the document as an iterable/addressable `PageList` object."""
        if self._pages is None:
            self._pages = PageList(self)
        return self._pages

    @property
    def names(self) -> Dict[str, Any]:
        """PDF name dictionary (PDF 1.7 sec 7.7.4).

        Raises:
          KeyError: if nonexistent.
        """
        return dict_value(self.catalog["Names"])

    @property
    def destinations(self) -> "Destinations":
        """Named destinations as an iterable/addressable `Destinations` object."""
        if self._destinations is None:
            self._destinations = Destinations(self)
        return self._destinations

`destinations` `property`

Named destinations as an iterable/addressable Destinations object.

`fonts` `property`

Get the mapping of font names to fonts for this document.

Note that this can be quite slow the first time it's accessed as it must scan every single page in the document.

Font names may collide.

Font names are generally understood to be globally unique ~~in the neighbourhood~~ in the document, but there's no guarantee that this is the case. In keeping with the "incremental update" philosophy dear to PDF, you get the last font defined with a given name.

`names` `property`

PDF name dictionary (PDF 1.7 sec 7.7.4).

Raises:

Type	Description
`KeyError`	if nonexistent.

`objects` `property`

Iterate over all indirect objects (including, then expanding object streams)

`outline` `property`

Document outline, if any.

`page_labels` `property`

Iterate over page label strings for the PDF document.

If the document includes page labels, this generates strings, one per page, otherwise it is None.

Unbounded iterator

This iterator is unbounded, because the page label tree has no relation to the actual page tree, so it is recommended to use pages instead.

`pages` `property`

Pages of the document as an iterable/addressable PageList object.

`structure` `property`

Logical structure of this document, if any.

In the case where no logical structure tree exists, this will be None. Otherwise you may iterate over it, search it, etc.

We do this instead of simply returning an empty structure tree because the vast majority of PDFs have no logical structure. Also, because the structure is a lazy object (the type signature here may change to Iterable[Element] at some point) there is no way to know if it's empty without iterating over it.

`tokens` `property`

Iterate over tokens.

`getitem(objid)`

Get an indirect object from the PDF.

Note that the behaviour in the case of a non-existent object (raising KeyError), while Pythonic, is not PDFic, as PDF 1.7 sec 7.3.10 states:

An indirect reference to an undefined object shall not be considered an error by a conforming reader; it shall be treated as a reference to the null object.

Raises:

Type	Description
`ValueError`	if Document cannot be initialized
`KeyError`	if objid does not exist in PDF

Source code in playa/document.py

def __getitem__(self, objid: int) -> PDFObject:
    """Get an indirect object from the PDF.

    Note that the behaviour in the case of a non-existent object
    (raising `KeyError`), while Pythonic, is not PDFic, as PDF
    1.7 sec 7.3.10 states:

    > An indirect reference to an undefined object shall not be
    considered an error by a conforming reader; it shall be
    treated as a reference to the null object.

    Raises:
      ValueError: if Document cannot be initialized
      KeyError: if objid does not exist in PDF

    """
    if objid == 0:
        raise KeyError("PDF object id cannot be 0.")

    if objid in self._cached_objs:
        if self._cached_objs[objid] is None:
            raise KeyError(f"Object with ID {objid} not found")
        return self._cached_objs[objid]
    obj = None

    for xref in self.xrefs:
        try:
            (strmid, index, genno) = xref[objid]
        except KeyError:
            continue
        try:
            if strmid is not None:
                stream = stream_value(self[strmid])
                obj = self._getobj_objstm(stream, index, objid)
            else:
                try:
                    obj = self._getobj_parse(index, objid)
                except PDFSyntaxError as e:
                    log.warning(
                        "Indirect object %d not found at position %d: %r",
                        objid,
                        index,
                        e,
                    )
                    # xref tables are clearly borked, so
                    # rebuild them and try again
                    log.warning("Rebuilding xref table from object parser")
                    fallback = XRefFallback(self)
                    self.trailer["Size"] = len(fallback)
                    log.debug(
                        "Updated /Size in trailer to %d", self.trailer["Size"]
                    )
                    self._xrefs = [fallback]
                    try:
                        (strmid, index, genno) = self._xrefs[0][objid]
                        obj = self._getobj_parse(index, objid)
                    except (KeyError, PDFSyntaxError) as e:
                        log.warning(
                            "Indirect object %d STILL not found at position %d: %r",
                            objid,
                            index,
                            e,
                        )
            break
        # FIXME: We might not actually want to catch these...
        except StopIteration:
            log.debug("EOF when searching for object %d", objid)
            continue
        except PDFSyntaxError as e:
            log.debug("Syntax error when searching for object %d: %s", objid, e)
            continue

    # Store it anyway as None if we can't find it to avoid costly searching
    self._cached_objs[objid] = obj
    if obj is None:
        raise KeyError(f"Object with ID {objid} not found")
    return self._cached_objs[objid]

`iter()`

Iterate over object IDs

Source code in playa/document.py

def __iter__(self) -> Iterator[int]:
    """Iterate over object IDs"""
    return itertools.chain.from_iterable(self.xrefs)

`len()`

Return the number of indirect objects in this PDF.

This number is unreliable and ephemeral.

In a conforming PDF, the number of objects is declared by the /Size key in the document trailer, but conforming PDFs do not exist in the real world. Upon opening a PDF, the trailer value will be returned here, but once the cross-reference tables have been loaded, a different number will be returned, and in the case where the cross-reference tables are invalid and must be regenerated, this value will be updated yet again. This is the price of laziness (not to be confused with the wages of sin).

Source code in playa/document.py

def __len__(self) -> int:
    """Return the number of indirect objects in this PDF.

    Danger: This number is unreliable and ephemeral.
        In a conforming PDF, the number of objects is declared by
        the `/Size` key in the document trailer, but conforming
        PDFs do not exist in the real world.  Upon opening a PDF,
        the trailer value will be returned here, but once the
        cross-reference tables have been loaded, a *different*
        number will be returned, and in the case where the
        cross-reference tables are invalid and must be
        regenerated, this value will be updated yet again.  This
        is the price of laziness (not to be confused with the
        wages of sin).
    """
    size = self.trailer.get("Size", None)
    if isinstance(size, int):
        return size
    size = sum(len(x) for x in self.xrefs)
    self.trailer["Size"] = size
    log.debug("Updated /Size in trailer to %d", size)
    return size

`PDFObjRef`

Source code in playa/pdftypes.py

class ObjRef:
    def __init__(
        self,
        doc: Union[DocumentRef, None] = None,
        objid: int = 0,
    ) -> None:
        """Reference to a PDF object.

        :param doc: The PDF document.
        :param objid: The object number.
        """
        self.doc = doc
        self.objid = objid

    def __eq__(self, other: Any) -> bool:
        if not isinstance(other, ObjRef):
            raise NotImplementedError("Unimplemented comparison with non-ObjRef")
        if self.doc is None and other.doc is None:
            return self.objid == other.objid
        elif self.doc is None or other.doc is None:
            return False
        else:
            selfdoc = _deref_document(self.doc)
            otherdoc = _deref_document(other.doc)
            return selfdoc is otherdoc and self.objid == other.objid

    def __hash__(self) -> int:
        return self.objid

    def __repr__(self) -> str:
        return "<ObjRef:%d>" % (self.objid)

    def resolve(self, default: PDFObject = None) -> PDFObject:
        if self.doc is None:
            return default
        doc = _deref_document(self.doc)
        try:
            return doc[self.objid]
        except KeyError:
            return default

`init(doc=None, objid=0)`

Reference to a PDF object.

:param doc: The PDF document. :param objid: The object number.

Source code in playa/pdftypes.py

def __init__(
    self,
    doc: Union[DocumentRef, None] = None,
    objid: int = 0,
) -> None:
    """Reference to a PDF object.

    :param doc: The PDF document.
    :param objid: The object number.
    """
    self.doc = doc
    self.objid = objid

`PDFPage`

Bases: Iterable[ContentObject]

An object that holds the information about a page.

Parameters:

Name	Type	Description	Default
`doc`	`Document`	a Document object.	required
`pageid`	`int`	the integer PDF object ID associated with the page in the page tree.	required
`attrs`	`Dict`	a dictionary of page attributes.	required
`label`	`Union[str, None]`	page label string.	required
`page_idx`	`int`	0-based index of the page in the document.	`0`
`space`	`DeviceSpace`	the device space to use for interpreting content	`'screen'`

Attributes:

Name	Type	Description
`pageid`		the integer object ID associated with the page in the page tree
`attrs`		a dictionary of page attributes.
`resources`	`Dict[str, PDFObject]`	a dictionary of resources used by the page.
`mediabox`		the physical size of the page.
`cropbox`		the crop rectangle of the page.
`rotate`		the page rotation (in degree).
`label`		the page's label (typically, the logical page number).
`page_idx`		0-based index of the page in the document.
`ctm`		coordinate transformation matrix from default user space to page's device space

Source code in playa/page.py

class Page(Iterable[ContentObject]):
    """An object that holds the information about a page.

    Args:
      doc: a Document object.
      pageid: the integer PDF object ID associated with the page in the page tree.
      attrs: a dictionary of page attributes.
      label: page label string.
      page_idx: 0-based index of the page in the document.
      space: the device space to use for interpreting content

    Attributes:
      pageid: the integer object ID associated with the page in the page tree
      attrs: a dictionary of page attributes.
      resources: a dictionary of resources used by the page.
      mediabox: the physical size of the page.
      cropbox: the crop rectangle of the page.
      rotate: the page rotation (in degree).
      label: the page's label (typically, the logical page number).
      page_idx: 0-based index of the page in the document.
      ctm: coordinate transformation matrix from default user space to
           page's device space
    """

    _structmap: Union[PageStructure, None] = None
    _marked_contents: Union[ContentSequence, None] = None
    _fontmap: Union[Mapping[str, "Font"], None] = None
    _textmap: Union[List[List[str]], None] = None

    def __init__(
        self,
        doc: "Document",
        pageid: int,
        attrs: Dict,
        label: Union[str, None],
        page_idx: int = 0,
        space: DeviceSpace = "screen",
    ) -> None:
        self.docref = _ref_document(doc)
        self.pageid = pageid
        self.attrs = attrs
        self.label = label
        self.page_idx = page_idx
        self.space = space
        self.pageref = _ref_page(self)
        self.lastmod = resolve1(self.attrs.get("LastModified"))
        try:
            self.resources: Dict[str, PDFObject] = dict_value(
                self.attrs.get("Resources")
            )
        except TypeError:
            log.warning("Resources missing or invalid from Page id %d", pageid)
            self.resources = {}
        try:
            self.mediabox = normalize_rect(rect_value(self.attrs["MediaBox"]))
        except KeyError:
            log.warning(
                "MediaBox missing from Page id %d (and not inherited),"
                " defaulting to US Letter (612x792)",
                pageid,
            )
            self.mediabox = (0, 0, 612, 792)
        except (ValueError, PDFSyntaxError):
            log.warning(
                "MediaBox %r invalid in Page id %d, defaulting to US Letter (612x792)",
                self.attrs["MediaBox"],
                pageid,
            )
            self.mediabox = (0, 0, 612, 792)
        self.cropbox = self.mediabox
        if "CropBox" in self.attrs:
            try:
                self.cropbox = normalize_rect(rect_value(self.attrs["CropBox"]))
            except (ValueError, PDFSyntaxError):
                log.warning(
                    "Invalid CropBox %r in /Page, defaulting to MediaBox",
                    self.attrs["CropBox"],
                )

        # This is supposed to be an int, but be robust to bogus PDFs where it isn't
        rotate = int(num_value(self.attrs.get("Rotate", 0)))
        self.set_initial_ctm(space, rotate)

        contents = resolve1(self.attrs.get("Contents"))
        if contents is None:
            self._contents = []
        else:
            if isinstance(contents, list):
                self._contents = contents
            else:
                self._contents = [contents]

    def set_initial_ctm(self, space: DeviceSpace, rotate: int) -> Matrix:
        """
        Set or update initial coordinate transform matrix.

        PDF 1.7 section 8.4.1: Initial value: a matrix that
        transforms default user coordinates to device coordinates.

        We keep this as `self.ctm` in order to transform layout
        attributes in tagged PDFs which are specified in default
        user space (PDF 1.7 section 14.8.5.4.3, table 344)

        If you wish to modify the rotation or the device space of the
        page, then you can do it with this method (the initial values
        are in the `rotate` and `space` properties).
        """
        # Normalize the rotation value
        rotate = (rotate + 360) % 360
        x0, y0, x1, y1 = self.mediabox
        width = x1 - x0
        height = y1 - y0
        self.ctm = MATRIX_IDENTITY
        if rotate == 90:
            # x' = y
            # y' = width - x
            self.ctm = (0, -1, 1, 0, 0, width)
        elif rotate == 180:
            # x' = width - x
            # y' = height - y
            self.ctm = (-1, 0, 0, -1, width, height)
        elif rotate == 270:
            # x' = height - y
            # y' = x
            self.ctm = (0, 1, -1, 0, height, 0)
        elif rotate != 0:
            log.warning(
                "Invalid rotation value %r (only multiples of 90 accepted)", rotate
            )
        # Apply this to the mediabox to determine device space
        (x0, y0, x1, y1) = transform_bbox(self.ctm, self.mediabox)
        width = x1 - x0
        height = y1 - y0
        # "screen" device space: origin is top left of MediaBox
        if space == "screen":
            self.ctm = mult_matrix(self.ctm, (1, 0, 0, -1, -x0, y1))
        # "page" device space: origin is bottom left of MediaBox
        elif space == "page":
            self.ctm = mult_matrix(self.ctm, (1, 0, 0, 1, -x0, -y0))
        # "default" device space: no transformation or rotation
        else:
            if space != "default":
                log.warning("Unknown device space: %r", space)
            self.ctm = MATRIX_IDENTITY
            width = height = 0
        self.space = space
        self.rotate = rotate
        return self.ctm

    @property
    def annotations(self) -> Iterator["Annotation"]:
        """Lazily iterate over page annotations."""
        alist = resolve1(self.attrs.get("Annots"))
        if alist is None:
            return
        if not isinstance(alist, list):
            log.warning("Invalid Annots list: %r", alist)
            return
        for obj in alist:
            try:
                yield Annotation.from_dict(obj, self)
            except (TypeError, ValueError, PDFSyntaxError) as e:
                log.warning("Invalid object %r in Annots: %s", obj, e)
                continue

    @property
    def doc(self) -> "Document":
        """Get associated document if it exists."""
        return _deref_document(self.docref)

    @property
    def streams(self) -> Iterator[ContentStream]:
        """Return resolved content streams."""
        for obj in self._contents:
            try:
                yield stream_value(obj)
            except TypeError:
                log.warning("Found non-stream in contents: %r", obj)

    @property
    def width(self) -> float:
        """Width of the page in default user space units."""
        x0, _, x1, _ = self.mediabox
        return x1 - x0

    @property
    def height(self) -> float:
        """Width of the page in default user space units."""
        _, y0, _, y1 = self.mediabox
        return y1 - y0

    @property
    def contents(self) -> Iterator[PDFObject]:
        """Iterator over PDF objects in the content streams."""
        for _, obj in ContentParser(self._contents, self.doc):
            yield obj

    def __iter__(self) -> Iterator["ContentObject"]:
        """Iterator over lazy layout objects."""
        return self.interp()

    def interp(
        self,
        filter_classes: Union[Collection[Type[ContentObject]], None] = None,
        restrict_ops: Union[Collection[PSKeyword], None] = None,
    ) -> Iterator[ContentObject]:
        return LazyInterpreter(
            self,
            self._contents,
            filter_classes=filter_classes,
            restrict_ops=restrict_ops,
        )

    @property
    def paths(self) -> Iterator["PathObject"]:
        """Iterator over lazy path objects."""
        return self.flatten(PathObject)

    @property
    def images(self) -> Iterator["ImageObject"]:
        """Iterator over lazy image objects."""
        return self.flatten(ImageObject)

    @property
    def texts(self) -> Iterator["TextObject"]:
        """Iterator over lazy text objects."""
        return self.flatten(TextObject)

    @property
    def glyphs(self) -> Iterator["GlyphObject"]:
        """Iterator over lazy glyph objects."""
        for text in self.flatten(TextObject):
            yield from text

    @property
    def xobjects(self) -> Iterator["XObjectObject"]:
        """Return resolved and rendered Form XObjects.

        This does *not* return any image or PostScript XObjects.  You
        can get images via the `images` property.  Apparently you
        aren't supposed to use PostScript XObjects for anything, ever.

        Note that these are the XObjects as rendered on the page, so
        you may see the same named XObject multiple times.  If you
        need to access their actual definitions you'll have to look at
        `page.resources`.

        This will also return Form XObjects within Form XObjects,
        except in the case of circular reference chains.
        """

        from typing import Set

        def xobjects_one(
            itor: Iterable["ContentObject"], parents: Set[int]
        ) -> Iterator["XObjectObject"]:
            for obj in itor:
                if isinstance(obj, XObjectObject):
                    stream_id = 0 if obj.stream.objid is None else obj.stream.objid
                    if stream_id not in parents:
                        yield obj
                        yield from xobjects_one(obj, parents | {stream_id})

        for obj in xobjects_one(self, set()):
            if isinstance(obj, XObjectObject):
                yield obj

    @property
    def tokens(self) -> Iterator[Token]:
        """Iterator over tokens in the content streams."""
        for stream in self._contents:
            for _, tok in Lexer(stream.buffer):
                yield tok

    @property
    def parent_key(self) -> Union[int, None]:
        """Parent tree key for this page, if any."""
        if "StructParents" in self.attrs:
            return int_value(self.attrs["StructParents"])
        return None

    @property
    def structure(self) -> PageStructure:
        """Mapping of marked content IDs to logical structure elements.

        This is a sequence of logical structure elements, or `None`
        for unused marked content IDs.  Note that because structure
        elements may contain multiple marked content sections, the
        same element may occur multiple times in this list.

        It also has `find` and `find_all` methods which allow you to
        access enclosing structural elements (you can also use the
        `parent` method of elements for that)

        Note: This is not the same as `playa.Document.structure`.
            PDF documents have logical structure, but PDF pages **do
            not**, and it is dishonest to pretend otherwise (as some
            code I once wrote unfortunately does).  What they do have
            is marked content sections which correspond to content
            items in the logical structure tree.

        """
        if self._structmap is not None:
            return self._structmap
        self._structmap = PageStructure(self.pageref, [])
        if self.doc.structure is None:
            return self._structmap
        parent_key = self.parent_key
        if parent_key is None:
            return self._structmap
        try:
            self._structmap = PageStructure(
                self.pageref, self.doc.structure.parent_tree[parent_key]
            )
        except (IndexError, TypeError) as e:
            log.warning("Invalid StructParents: %r (%s)", parent_key, e)
        return self._structmap

    @property
    def marked_content(self) -> ContentSequence:
        """A [`ContentSequence`][playa.content.ContentSequence] containing
        content objects associated with the structural elements in
        [`structure`][playa.page.Page.structure].  They consist of a
        sequence with the same indices (these are the marked content
        IDs) as the structure so can be zipped:

            for element, contents in zip(page.structure,
                                         page.marked_content):
                for obj in contents:
                    ...  # do something with it

        Or you can also access the contents of a single element:

            for obj in page.marked_content[mcid]:
                ... # do something with it

        """
        if self._marked_contents is not None:
            return self._marked_contents
        self._marked_contents = ContentSequence(self)
        return self._marked_contents

    @property
    def fonts(self) -> Mapping[str, "Font"]:
        """Mapping of resource names to fonts for this page.

        Note: This is not the same as `playa.Document.fonts`.
            The resource names (e.g. `F1`, `F42`, `FooBar`) here are
            specific to a page (or Form XObject) resource dictionary
            and have no relation to the font name as commonly
            understood (e.g. `Helvetica`,
            `WQERQE+Arial-SuperBold-HJRE-UTF-8`).  Since font names are
            generally considered to be globally unique, it may be
            possible to access fonts by them in the future.

        Note: This does not include fonts specific to Form XObjects.
            Since it is possible for the resource names to collide,
            this will only return the fonts for a page and not for any
            Form XObjects invoked on it.  You may use
            `XObjectObject.fonts` to access these.

        """
        if self._fontmap is not None:
            return self._fontmap
        self._fontmap = FontMapping(self.resources.get("Font"), self.doc)
        return self._fontmap

    def __repr__(self) -> str:
        return f"<Page: Resources={self.resources!r}, MediaBox={self.mediabox!r}>"

    @overload
    def flatten(self) -> Iterator["ContentObject"]: ...

    @overload
    def flatten(self, filter_class: Type[CO]) -> Iterator[CO]: ...

    @overload
    def flatten(
        self, filter_class: Type[CO], restrict_ops: Collection[PSKeyword]
    ) -> Iterator[CO]: ...

    def flatten(
        self,
        filter_class: Union[None, Type[CO]] = None,
        restrict_ops: Union[Collection[PSKeyword], None] = None,
    ) -> Iterator[Union[CO, "ContentObject"]]:
        """Iterate over content objects, recursing into form XObjects."""

        from typing import Set

        if filter_class is not None:
            filter_classes: Union[Set[Type[ContentObject]], None] = {
                XObjectObject,
                filter_class,
            }
        else:
            filter_classes = None

        def flatten_one(
            itor: Iterable["ContentObject"], parents: Set[int]
        ) -> Iterator["ContentObject"]:
            for obj in itor:
                if isinstance(obj, XObjectObject):
                    stream_id = 0 if obj.stream.objid is None else obj.stream.objid
                    if stream_id not in parents:
                        yield from flatten_one(
                            obj.interp(
                                filter_classes=filter_classes, restrict_ops=restrict_ops
                            ),
                            parents | {stream_id},
                        )
                else:
                    yield obj

        for obj in flatten_one(
            self.interp(filter_classes=filter_classes, restrict_ops=restrict_ops),
            set(),
        ):
            yield obj

    def extract_text(self, *, bbox: Union[Rect, None] = None) -> str:
        """Do some best-effort text extraction.

        This necessarily involves a few heuristics, so don't get your
        hopes up.  It will attempt to use marked content information
        for a tagged PDF, otherwise it will fall back on the character
        displacement and line matrix to determine word and line breaks.

        Args:
          bbox: If not `None`, only text objects which intersect this
                rectangle will be extracted.

        Note: `bbox` operates on text objects, not glyphs.
          You may be surprised to get a lot more text than you
          expected if you call this with a `bbox`.  This is because,
          for efficiency reasons, the intersection check is performed
          on text objects (strings in the content stream) rather than
          glyphs (individual characters).  For many tasks like table
          extraction, this is actually what you want.  Otherwise, you
          will need to use `playa.utils.intersect_rects` on the
          individual glyph bboxes, which is, of course, slow.
        """
        if self.doc.is_tagged:
            return self.extract_text_tagged(bbox=bbox)
        else:
            return self.extract_text_untagged(bbox=bbox)

    def extract_text_untagged(self, *, bbox: Union[Rect, None] = None) -> str:
        """Get text from a page of an untagged PDF."""

        def _extract_text_from_obj(
            obj: "TextObject", vertical: bool, prev_end: float
        ) -> Tuple[str, float]:
            """Try to get text from a text object."""
            chars: List[str] = []
            for glyph in obj:
                x, y = glyph.origin
                off = y if vertical else x
                # 0.5 here is a heuristic!!!
                if prev_end and off - prev_end > 0.5:
                    if chars and chars[-1] != " ":
                        chars.append(" ")
                if glyph.text is not None:
                    chars.append(glyph.text)
                dx, dy = glyph.displacement
                prev_end = off + (dy if vertical else dx)
            return "".join(chars), prev_end

        prev_end = 0.0
        prev_origin: Union[Point, None] = None
        lines = []
        strings: List[str] = []
        itor = self.flatten(filter_class=TextObject, restrict_ops=TEXT_OPERATORS)
        for text in itor:
            if text.gstate.font is None:
                continue
            vertical = text.gstate.font.vertical
            # Track changes to the translation component of text
            # rendering matrix to (yes, heuristically) detect newlines
            # and spaces between text objects
            dx, dy = text.origin
            off = dy if vertical else dx
            if strings and self._next_line(text, prev_origin):
                lines.append("".join(strings))
                strings.clear()
            # 0.5 here is a heuristic!!!
            if strings and off - prev_end > 0.5 and not strings[-1].endswith(" "):
                strings.append(" ")
            textstr, prev_end = _extract_text_from_obj(text, vertical, off)
            if bbox is None:
                strings.append(textstr)
            else:
                ex, ey = (dx, prev_end) if vertical else (prev_end, dy)
                if _crosses_bbox((dx, dy), (ex, ey), bbox):
                    strings.append(textstr)
            prev_origin = dx, dy
        if strings:
            lines.append("".join(strings))
        return "\n".join(lines)

    def _next_line(
        self, text: Union[TextObject, None], prev_offset: Union[Point, None]
    ) -> bool:
        if text is None:
            return False
        if text.gstate.font is None:
            return False
        if prev_offset is None:
            return False
        offset = text.origin

        # Vertical text (usually) means right-to-left lines
        if text.gstate.font.vertical:
            line_offset = offset[0] - prev_offset[0]
        else:
            # The CTM isn't useful here because we actually do care
            # about the final device space, and we just want to know
            # which way is up and which way is down.
            dy = offset[1] - prev_offset[1]
            if self.space == "screen":
                line_offset = -dy
            else:
                line_offset = dy
        return line_offset < 0

    def extract_text_tagged(self, *, bbox: Union[Rect, None] = None) -> str:
        """Get text from a page of a tagged PDF."""
        lines: List[str] = []
        strings: List[str] = []
        prev_mcid: Union[int, None] = None
        prev_origin: Union[Point, None] = None
        # TODO: Iteration over marked content sections and getting
        # their text, origin, and displacement, will be refactored
        itor = self.flatten(filter_class=TextObject, restrict_ops=TEXT_OPERATORS)
        for mcs, texts in itertools.groupby(itor, operator.attrgetter("mcs")):
            text: Union[TextObject, None] = None
            # TODO: Artifact can also be a structure element, but
            # also, any content outside the structure tree is
            # considered an artifact
            if mcs is None or mcs.tag == "Artifact":
                for text in texts:
                    prev_origin = text.origin
                continue
            actual_text = mcs.props.get("ActualText")
            if actual_text is None:
                reversed = mcs.tag == "ReversedChars"
                c = []
                for text in texts:  # noqa: B031
                    if bbox is None:
                        c.append(text.chars[::-1] if reversed else text.chars)
                    else:
                        x, y = text.origin
                        dx, dy = text.displacement
                        ex, ey = (x + dx, y + dy)
                        if _crosses_bbox((x, y), (ex, ey), bbox):
                            c.append(text.chars[::-1] if reversed else text.chars)
                chars = "".join(c)
            else:
                assert isinstance(actual_text, bytes)
                # It's a text string so decode_text it
                chars = decode_text(actual_text)
                # Consume all text objects to ensure correct graphicstate
                for _ in texts:  # noqa: B031
                    pass

            # Remove soft hyphens
            chars = chars.replace("\xad", "")
            # There *might* be a line break, determine based on origin
            if mcs.mcid != prev_mcid:
                if self._next_line(text, prev_origin):
                    lines.extend(textwrap.wrap("".join(strings)))
                    strings.clear()
                prev_mcid = mcs.mcid
            strings.append(chars)
            if text is not None:
                prev_origin = text.origin
        if strings:
            lines.extend(textwrap.wrap("".join(strings)))
        return "\n".join(lines)

`annotations` `property`

Lazily iterate over page annotations.

`contents` `property`

Iterator over PDF objects in the content streams.

`doc` `property`

Get associated document if it exists.

`fonts` `property`

Mapping of resource names to fonts for this page.

This is not the same as playa.Document.fonts.

The resource names (e.g. F1, F42, FooBar) here are specific to a page (or Form XObject) resource dictionary and have no relation to the font name as commonly understood (e.g. Helvetica, WQERQE+Arial-SuperBold-HJRE-UTF-8). Since font names are generally considered to be globally unique, it may be possible to access fonts by them in the future.

This does not include fonts specific to Form XObjects.

Since it is possible for the resource names to collide, this will only return the fonts for a page and not for any Form XObjects invoked on it. You may use XObjectObject.fonts to access these.

`glyphs` `property`

Iterator over lazy glyph objects.

`height` `property`

Width of the page in default user space units.

`images` `property`

Iterator over lazy image objects.

`marked_content` `property`

A ContentSequence containing content objects associated with the structural elements in structure. They consist of a sequence with the same indices (these are the marked content IDs) as the structure so can be zipped:

for element, contents in zip(page.structure,
                             page.marked_content):
    for obj in contents:
        ...  # do something with it

Or you can also access the contents of a single element:

for obj in page.marked_content[mcid]:
    ... # do something with it

`parent_key` `property`

Parent tree key for this page, if any.

`paths` `property`

Iterator over lazy path objects.

`streams` `property`

Return resolved content streams.

`structure` `property`

Mapping of marked content IDs to logical structure elements.

This is a sequence of logical structure elements, or None for unused marked content IDs. Note that because structure elements may contain multiple marked content sections, the same element may occur multiple times in this list.

It also has find and find_all methods which allow you to access enclosing structural elements (you can also use the parent method of elements for that)

This is not the same as playa.Document.structure.

PDF documents have logical structure, but PDF pages do not, and it is dishonest to pretend otherwise (as some code I once wrote unfortunately does). What they do have is marked content sections which correspond to content items in the logical structure tree.

`texts` `property`

Iterator over lazy text objects.

`tokens` `property`

Iterator over tokens in the content streams.

`width` `property`

Width of the page in default user space units.

`xobjects` `property`

Return resolved and rendered Form XObjects.

This does not return any image or PostScript XObjects. You can get images via the images property. Apparently you aren't supposed to use PostScript XObjects for anything, ever.

Note that these are the XObjects as rendered on the page, so you may see the same named XObject multiple times. If you need to access their actual definitions you'll have to look at page.resources.

This will also return Form XObjects within Form XObjects, except in the case of circular reference chains.

`iter()`

Iterator over lazy layout objects.

Source code in playa/page.py

def __iter__(self) -> Iterator["ContentObject"]:
    """Iterator over lazy layout objects."""
    return self.interp()

`extract_text(*, bbox=None)`

Do some best-effort text extraction.

This necessarily involves a few heuristics, so don't get your hopes up. It will attempt to use marked content information for a tagged PDF, otherwise it will fall back on the character displacement and line matrix to determine word and line breaks.

Parameters:

Name	Type	Description	Default
`bbox`	`Union[Rect, None]`	If not `None`, only text objects which intersect this rectangle will be extracted.	`None`

bbox operates on text objects, not glyphs.

You may be surprised to get a lot more text than you expected if you call this with a bbox. This is because, for efficiency reasons, the intersection check is performed on text objects (strings in the content stream) rather than glyphs (individual characters). For many tasks like table extraction, this is actually what you want. Otherwise, you will need to use playa.utils.intersect_rects on the individual glyph bboxes, which is, of course, slow.

Source code in playa/page.py

def extract_text(self, *, bbox: Union[Rect, None] = None) -> str:
    """Do some best-effort text extraction.

    This necessarily involves a few heuristics, so don't get your
    hopes up.  It will attempt to use marked content information
    for a tagged PDF, otherwise it will fall back on the character
    displacement and line matrix to determine word and line breaks.

    Args:
      bbox: If not `None`, only text objects which intersect this
            rectangle will be extracted.

    Note: `bbox` operates on text objects, not glyphs.
      You may be surprised to get a lot more text than you
      expected if you call this with a `bbox`.  This is because,
      for efficiency reasons, the intersection check is performed
      on text objects (strings in the content stream) rather than
      glyphs (individual characters).  For many tasks like table
      extraction, this is actually what you want.  Otherwise, you
      will need to use `playa.utils.intersect_rects` on the
      individual glyph bboxes, which is, of course, slow.
    """
    if self.doc.is_tagged:
        return self.extract_text_tagged(bbox=bbox)
    else:
        return self.extract_text_untagged(bbox=bbox)

`extract_text_tagged(*, bbox=None)`

Get text from a page of a tagged PDF.

Source code in playa/page.py

def extract_text_tagged(self, *, bbox: Union[Rect, None] = None) -> str:
    """Get text from a page of a tagged PDF."""
    lines: List[str] = []
    strings: List[str] = []
    prev_mcid: Union[int, None] = None
    prev_origin: Union[Point, None] = None
    # TODO: Iteration over marked content sections and getting
    # their text, origin, and displacement, will be refactored
    itor = self.flatten(filter_class=TextObject, restrict_ops=TEXT_OPERATORS)
    for mcs, texts in itertools.groupby(itor, operator.attrgetter("mcs")):
        text: Union[TextObject, None] = None
        # TODO: Artifact can also be a structure element, but
        # also, any content outside the structure tree is
        # considered an artifact
        if mcs is None or mcs.tag == "Artifact":
            for text in texts:
                prev_origin = text.origin
            continue
        actual_text = mcs.props.get("ActualText")
        if actual_text is None:
            reversed = mcs.tag == "ReversedChars"
            c = []
            for text in texts:  # noqa: B031
                if bbox is None:
                    c.append(text.chars[::-1] if reversed else text.chars)
                else:
                    x, y = text.origin
                    dx, dy = text.displacement
                    ex, ey = (x + dx, y + dy)
                    if _crosses_bbox((x, y), (ex, ey), bbox):
                        c.append(text.chars[::-1] if reversed else text.chars)
            chars = "".join(c)
        else:
            assert isinstance(actual_text, bytes)
            # It's a text string so decode_text it
            chars = decode_text(actual_text)
            # Consume all text objects to ensure correct graphicstate
            for _ in texts:  # noqa: B031
                pass

        # Remove soft hyphens
        chars = chars.replace("\xad", "")
        # There *might* be a line break, determine based on origin
        if mcs.mcid != prev_mcid:
            if self._next_line(text, prev_origin):
                lines.extend(textwrap.wrap("".join(strings)))
                strings.clear()
            prev_mcid = mcs.mcid
        strings.append(chars)
        if text is not None:
            prev_origin = text.origin
    if strings:
        lines.extend(textwrap.wrap("".join(strings)))
    return "\n".join(lines)

`extract_text_untagged(*, bbox=None)`

Get text from a page of an untagged PDF.

Source code in playa/page.py

def extract_text_untagged(self, *, bbox: Union[Rect, None] = None) -> str:
    """Get text from a page of an untagged PDF."""

    def _extract_text_from_obj(
        obj: "TextObject", vertical: bool, prev_end: float
    ) -> Tuple[str, float]:
        """Try to get text from a text object."""
        chars: List[str] = []
        for glyph in obj:
            x, y = glyph.origin
            off = y if vertical else x
            # 0.5 here is a heuristic!!!
            if prev_end and off - prev_end > 0.5:
                if chars and chars[-1] != " ":
                    chars.append(" ")
            if glyph.text is not None:
                chars.append(glyph.text)
            dx, dy = glyph.displacement
            prev_end = off + (dy if vertical else dx)
        return "".join(chars), prev_end

    prev_end = 0.0
    prev_origin: Union[Point, None] = None
    lines = []
    strings: List[str] = []
    itor = self.flatten(filter_class=TextObject, restrict_ops=TEXT_OPERATORS)
    for text in itor:
        if text.gstate.font is None:
            continue
        vertical = text.gstate.font.vertical
        # Track changes to the translation component of text
        # rendering matrix to (yes, heuristically) detect newlines
        # and spaces between text objects
        dx, dy = text.origin
        off = dy if vertical else dx
        if strings and self._next_line(text, prev_origin):
            lines.append("".join(strings))
            strings.clear()
        # 0.5 here is a heuristic!!!
        if strings and off - prev_end > 0.5 and not strings[-1].endswith(" "):
            strings.append(" ")
        textstr, prev_end = _extract_text_from_obj(text, vertical, off)
        if bbox is None:
            strings.append(textstr)
        else:
            ex, ey = (dx, prev_end) if vertical else (prev_end, dy)
            if _crosses_bbox((dx, dy), (ex, ey), bbox):
                strings.append(textstr)
        prev_origin = dx, dy
    if strings:
        lines.append("".join(strings))
    return "\n".join(lines)

`flatten(filter_class=None, restrict_ops=None)`

flatten() -> Iterator[ContentObject]

flatten(filter_class: Type[CO]) -> Iterator[CO]

flatten(filter_class: Type[CO], restrict_ops: Collection[PSKeyword]) -> Iterator[CO]

Iterate over content objects, recursing into form XObjects.

Source code in playa/page.py

def flatten(
    self,
    filter_class: Union[None, Type[CO]] = None,
    restrict_ops: Union[Collection[PSKeyword], None] = None,
) -> Iterator[Union[CO, "ContentObject"]]:
    """Iterate over content objects, recursing into form XObjects."""

    from typing import Set

    if filter_class is not None:
        filter_classes: Union[Set[Type[ContentObject]], None] = {
            XObjectObject,
            filter_class,
        }
    else:
        filter_classes = None

    def flatten_one(
        itor: Iterable["ContentObject"], parents: Set[int]
    ) -> Iterator["ContentObject"]:
        for obj in itor:
            if isinstance(obj, XObjectObject):
                stream_id = 0 if obj.stream.objid is None else obj.stream.objid
                if stream_id not in parents:
                    yield from flatten_one(
                        obj.interp(
                            filter_classes=filter_classes, restrict_ops=restrict_ops
                        ),
                        parents | {stream_id},
                    )
            else:
                yield obj

    for obj in flatten_one(
        self.interp(filter_classes=filter_classes, restrict_ops=restrict_ops),
        set(),
    ):
        yield obj

`set_initial_ctm(space, rotate)`

Set or update initial coordinate transform matrix.

PDF 1.7 section 8.4.1: Initial value: a matrix that transforms default user coordinates to device coordinates.

We keep this as self.ctm in order to transform layout attributes in tagged PDFs which are specified in default user space (PDF 1.7 section 14.8.5.4.3, table 344)

If you wish to modify the rotation or the device space of the page, then you can do it with this method (the initial values are in the rotate and space properties).

Source code in playa/page.py

def set_initial_ctm(self, space: DeviceSpace, rotate: int) -> Matrix:
    """
    Set or update initial coordinate transform matrix.

    PDF 1.7 section 8.4.1: Initial value: a matrix that
    transforms default user coordinates to device coordinates.

    We keep this as `self.ctm` in order to transform layout
    attributes in tagged PDFs which are specified in default
    user space (PDF 1.7 section 14.8.5.4.3, table 344)

    If you wish to modify the rotation or the device space of the
    page, then you can do it with this method (the initial values
    are in the `rotate` and `space` properties).
    """
    # Normalize the rotation value
    rotate = (rotate + 360) % 360
    x0, y0, x1, y1 = self.mediabox
    width = x1 - x0
    height = y1 - y0
    self.ctm = MATRIX_IDENTITY
    if rotate == 90:
        # x' = y
        # y' = width - x
        self.ctm = (0, -1, 1, 0, 0, width)
    elif rotate == 180:
        # x' = width - x
        # y' = height - y
        self.ctm = (-1, 0, 0, -1, width, height)
    elif rotate == 270:
        # x' = height - y
        # y' = x
        self.ctm = (0, 1, -1, 0, height, 0)
    elif rotate != 0:
        log.warning(
            "Invalid rotation value %r (only multiples of 90 accepted)", rotate
        )
    # Apply this to the mediabox to determine device space
    (x0, y0, x1, y1) = transform_bbox(self.ctm, self.mediabox)
    width = x1 - x0
    height = y1 - y0
    # "screen" device space: origin is top left of MediaBox
    if space == "screen":
        self.ctm = mult_matrix(self.ctm, (1, 0, 0, -1, -x0, y1))
    # "page" device space: origin is bottom left of MediaBox
    elif space == "page":
        self.ctm = mult_matrix(self.ctm, (1, 0, 0, 1, -x0, -y0))
    # "default" device space: no transformation or rotation
    else:
        if space != "default":
            log.warning("Unknown device space: %r", space)
        self.ctm = MATRIX_IDENTITY
        width = height = 0
    self.space = space
    self.rotate = rotate
    return self.ctm

`PDFTypeError`

Bases: PDFException

TypeError, but for PDFs (not a subclass of TypeError, unlike in pdfminer.six)

Source code in playa/miner.py

class PDFTypeError(PDFException):
    """
    TypeError, but for PDFs (not a subclass of TypeError, unlike in pdfminer.six)
    """

    pass

`PDFValueError`

Bases: PDFException

ValueError, but for PDFs (not a subclass of ValueError, unlike in pdfminer.six)

Source code in playa/miner.py

class PDFValueError(PDFException):
    """
    ValueError, but for PDFs (not a subclass of ValueError, unlike in pdfminer.six)
    """

    pass

`PSLiteral`

A class that represents a PostScript literal.

Postscript literals are used as identifiers, such as variable names, property names and dictionary keys. Literals are case sensitive and denoted by a preceding slash sign (e.g. "/Name"). They are globally unique objects stored in PSLiteralTable.

Source code in playa/pdftypes.py

class PSLiteral:
    """A class that represents a PostScript literal.

    Postscript literals are used as identifiers, such as variable
    names, property names and dictionary keys.  Literals are case
    sensitive and denoted by a preceding slash sign (e.g. "/Name").
    They are globally unique objects stored in PSLiteralTable.
    """

    name: str

    def __new__(cls, name: str) -> "PSLiteral":
        if name not in PSLiteralTable:
            PSLiteralTable[name] = object.__new__(cls)
            PSLiteralTable[name].name = name
        return PSLiteralTable[name]

    def __getnewargs__(self) -> Tuple:
        return (self.name,)

    def __repr__(self) -> str:
        return "/%r" % self.name

`Plane`

Bases: Generic[LTComponentT]

A set-like data structure for objects placed on a plane.

Can efficiently find objects in a certain rectangular area. It maintains two parallel lists of objects, each of which is sorted by its x or y coordinate.

Source code in playa/miner.py

class Plane(Generic[LTComponentT]):
    """A set-like data structure for objects placed on a plane.

    Can efficiently find objects in a certain rectangular area.
    It maintains two parallel lists of objects, each of
    which is sorted by its x or y coordinate.
    """

    def __init__(self, bbox: Rect, gridsize: int = 50) -> None:
        self._seq: List[LTComponentT] = []  # preserve the object order.
        self._objs: Dict[int, LTComponentT] = {}  # store unique objects
        self._grid: Dict[Point, List[LTComponentT]] = {}
        self.gridsize = gridsize
        (self.x0, self.y0, self.x1, self.y1) = bbox

    def __repr__(self) -> str:
        return "<Plane objs=%r>" % list(self)

    def __iter__(self) -> Iterator[LTComponentT]:
        for obj in self._seq:
            if id(obj) in self._objs:
                yield obj

    def __len__(self) -> int:
        return len(self._objs)

    def __contains__(self, obj: LTComponentT) -> bool:
        return id(obj) in self._objs

    def _getrange(self, bbox: Rect) -> Iterator[Point]:
        (x0, y0, x1, y1) = bbox
        if x1 <= self.x0 or self.x1 <= x0 or y1 <= self.y0 or self.y1 <= y0:
            return
        x0 = max(self.x0, x0)
        y0 = max(self.y0, y0)
        x1 = min(self.x1, x1)
        y1 = min(self.y1, y1)
        for grid_y in drange(y0, y1, self.gridsize):
            for grid_x in drange(x0, x1, self.gridsize):
                yield (grid_x, grid_y)

    def extend(self, objs: Iterable[LTComponentT]) -> None:
        for obj in objs:
            self.add(obj)

    def add(self, obj: LTComponentT) -> None:
        """Place an object."""
        for k in self._getrange((obj.x0, obj.y0, obj.x1, obj.y1)):
            if k not in self._grid:
                r: List[LTComponentT] = []
                self._grid[k] = r
            else:
                r = self._grid[k]
            r.append(obj)
        self._seq.append(obj)
        self._objs[id(obj)] = obj

    def remove(self, obj: LTComponentT) -> None:
        """Displace an object."""
        for k in self._getrange((obj.x0, obj.y0, obj.x1, obj.y1)):
            try:
                self._grid[k].remove(obj)
            except (KeyError, ValueError):
                pass
        del self._objs[id(obj)]

    def find(self, bbox: Rect) -> Iterator[LTComponentT]:
        """Finds objects that are in a certain area."""
        (x0, y0, x1, y1) = bbox
        done: Set[int] = set()
        for k in self._getrange(bbox):
            if k not in self._grid:
                continue
            for obj in self._grid[k]:
                if id(obj) in done:
                    continue
                done.add(id(obj))
                if obj.x1 <= x0 or x1 <= obj.x0 or obj.y1 <= y0 or y1 <= obj.y0:
                    continue
                yield obj

`add(obj)`

Place an object.

Source code in playa/miner.py

def add(self, obj: LTComponentT) -> None:
    """Place an object."""
    for k in self._getrange((obj.x0, obj.y0, obj.x1, obj.y1)):
        if k not in self._grid:
            r: List[LTComponentT] = []
            self._grid[k] = r
        else:
            r = self._grid[k]
        r.append(obj)
    self._seq.append(obj)
    self._objs[id(obj)] = obj

`find(bbox)`

Finds objects that are in a certain area.

Source code in playa/miner.py

def find(self, bbox: Rect) -> Iterator[LTComponentT]:
    """Finds objects that are in a certain area."""
    (x0, y0, x1, y1) = bbox
    done: Set[int] = set()
    for k in self._getrange(bbox):
        if k not in self._grid:
            continue
        for obj in self._grid[k]:
            if id(obj) in done:
                continue
            done.add(id(obj))
            if obj.x1 <= x0 or x1 <= obj.x0 or obj.y1 <= y0 or y1 <= obj.y0:
                continue
            yield obj

`remove(obj)`

Displace an object.

Source code in playa/miner.py

def remove(self, obj: LTComponentT) -> None:
    """Displace an object."""
    for k in self._getrange((obj.x0, obj.y0, obj.x1, obj.y1)):
        try:
            self._grid[k].remove(obj)
        except (KeyError, ValueError):
            pass
    del self._objs[id(obj)]

`decode_text(s)`

Decodes a text string (see PDF 1.7 section 7.9.2.2 - it could be PDFDocEncoding or UTF-16BE) to a str.

Source code in playa/utils.py

def decode_text(s: bytes) -> str:
    """Decodes a text string (see PDF 1.7 section 7.9.2.2 - it could
    be PDFDocEncoding or UTF-16BE) to a `str`.
    """
    # Sure, it could be UTF-16LE... \/\/hatever...
    if isinstance(s, bytes) and (
        s.startswith(b"\xfe\xff") or s.startswith(b"\xff\xfe")
    ):
        try:
            return s.decode("UTF-16")
        except UnicodeDecodeError:
            # Sure, it could have a BOM and not actually be UTF-16, \/\/TF...
            s = s[2:]
    try:
        return "".join(PDFDocEncoding[c] for c in s)
    except IndexError:
        # This is obviously wrong, but a reasonable fallback
        return s.decode("iso-8859-1")

`drange(v0, v1, d)`

Returns a discrete range.

Source code in playa/miner.py

def drange(v0: float, v1: float, d: int) -> range:
    """Returns a discrete range."""
    return range(int(v0) // d, int(v1 + d) // d)

`extract(path, laparams=None, max_workers=1, mp_context=None)`

Extract LTPages from a document.

Source code in playa/miner.py

def extract(
    path: Path,
    laparams: Union[LAParams, None] = None,
    max_workers: Union[int, None] = 1,
    mp_context: Union[BaseContext, None] = None,
) -> Iterator[LTPage]:
    """Extract LTPages from a document."""
    if max_workers is None:
        max_workers = multiprocessing.cpu_count()
    with playa.open(
        path,
        space="page",
        max_workers=max_workers,
        mp_context=mp_context,
    ) as pdf:
        if max_workers == 1:
            for page in pdf.pages:
                yield extract_page(page, laparams)
        else:
            yield from pdf.pages.map(partial(extract_page, laparams=laparams))

`extract_page(page, laparams=None)`

Extract an LTPage from a Page, and possibly do some layout analysis.

Parameters:

Name	Type	Description	Default
`page`	`Page`	a Page as returned by PLAYA (please create this with space="page" if you want pdfminer.six compatibility).	required
`laparams`	`Union[LAParams, None]`	if None, no layout analysis is done. Otherwise do some kind of heuristic magic that all "Artificial Intelligence" depends on but nobody actually understands.	`None`

Returns:

Type	Description
`LTPage`	An analysis of the page as `pdfminer.six` would give you.

Source code in playa/miner.py

def extract_page(page: Page, laparams: Union[LAParams, None] = None) -> LTPage:
    """Extract an LTPage from a Page, and possibly do some layout analysis.

    Args:
        page: a Page as returned by PLAYA (please create this with
              space="page" if you want pdfminer.six compatibility).
        laparams: if None, no layout analysis is done. Otherwise do
                  some kind of heuristic magic that all "Artificial
                  Intelligence" depends on but nobody actually
                  understands.

    Returns:
        An analysis of the page as `pdfminer.six` would give you.
    """
    # This is the mediabox in device space rather than default user
    # space, which is the source of some confusion
    (x0, y0, x1, y1) = page.mediabox
    # Note that a page can never be rotated by a non-multiple of 90
    # degrees (pi / 2 for nerds) so that's why we only care about two
    # of its corners
    (x0, y0) = apply_matrix_pt(page.ctm, (x0, y0))
    (x1, y1) = apply_matrix_pt(page.ctm, (x1, y1))
    # FIXME: The translation of the mediabox here is useless due to
    # the above transformation (but this should be verified against
    # pdfminer.six)
    mediabox = (0, 0, abs(x0 - x1), abs(y0 - y1))
    ltpage = LTPage(page.page_idx + 1, mediabox)

    # Emulating PDFLayoutAnalyzer is fairly simple and maps almost
    # directly onto PLAYA's lazy API.  XObjects and inline images
    # produce an LTFigure, characters produce an LTChar, everything
    # else produces an LTLine, LTRect, or LTCurve.
    for obj in page:
        # Put this in some functions to avoid isinstance abuse
        for item in process_object(obj):
            ltpage.add(item)

    if laparams is not None:
        ltpage.analyze(laparams)

    return ltpage

`fsplit(pred, objs)`

Split a list into two classes according to the predicate.

Source code in playa/miner.py

def fsplit(pred: Callable[[_T], bool], objs: Iterable[_T]) -> Tuple[List[_T], List[_T]]:
    """Split a list into two classes according to the predicate."""
    t = []
    f = []
    for obj in objs:
        if pred(obj):
            t.append(obj)
        else:
            f.append(obj)
    return t, f

`make_path_segment(op, points)`

Create a type-safe PathSegment, unlike pdfminer.six.

Source code in playa/miner.py

def make_path_segment(op: PathOperator, points: List[Point]) -> PathSegment:
    """Create a type-safe PathSegment, unlike pdfminer.six."""
    if len(points) == 0:
        if op != "h":
            raise ValueError("Incorrect arguments for {op!r}: {points!r}")
        return (str(op),)
    if len(points) == 1:
        if op not in "ml":
            raise ValueError("Incorrect arguments for {op!r}: {points!r}")
        return (str(op), points[0])
    if len(points) == 2:
        if op not in "vy":
            raise ValueError("Incorrect arguments for {op!r}: {points!r}")
        return (str(op), points[0], points[1])
    if len(points) == 3:
        if op != "c":
            raise ValueError("Incorrect arguments for {op!r}: {points!r}")
        return (str(op), points[0], points[1], points[2])
    raise ValueError(f"Path segment has unknown number of points: {op!r} {points!r}")

`process_object(obj)`

Handle obj according to its type

Source code in playa/miner.py

@singledispatch
def process_object(obj: ContentObject) -> Iterator[LTComponent]:
    """Handle obj according to its type"""
    yield from ()

`resolve1(x, default=None)`

Resolves an object.

If this is an array or dictionary, it may still contains some indirect objects inside.

Source code in playa/pdftypes.py

def resolve1(x: PDFObject, default: PDFObject = None) -> PDFObject:
    """Resolves an object.

    If this is an array or dictionary, it may still contains
    some indirect objects inside.
    """
    while isinstance(x, ObjRef):
        x = x.resolve(default=default)
    return x

`resolve_all(x, default=None)`

Resolves all indirect object references inside the given object.

This creates new copies of any lists or dictionaries, so the original object is not modified. However, it will ultimately create circular references if they exist, so beware.

Source code in playa/pdftypes.py

def resolve_all(x: PDFObject, default: PDFObject = None) -> PDFObject:
    """Resolves all indirect object references inside the given object.

    This creates new copies of any lists or dictionaries, so the
    original object is not modified.  However, it will ultimately
    create circular references if they exist, so beware.
    """

    def resolver(
        x: PDFObject, default: PDFObject, seen: Dict[int, PDFObject]
    ) -> PDFObject:
        if isinstance(x, ObjRef):
            ref = x
            while isinstance(x, ObjRef):
                if x.objid in seen:
                    return seen[x.objid]
                x = x.resolve(default=default)
            seen[ref.objid] = x
        if isinstance(x, list):
            return [resolver(v, default, seen) for v in x]
        elif isinstance(x, dict):
            return {k: resolver(v, default, seen) for k, v in x.items()}
        return x

    return resolver(x, default, {})

`subpaths(path)`

Iterate over "subpaths".

Note: subpaths inherit the values of fill and evenodd from the parent path, but these values are no longer meaningful since the winding rules must be applied to the composite path as a whole (this is not a bug, just don't rely on them to know which regions are filled or not).

Source code in playa/miner.py

def subpaths(path: PathObject) -> Iterator[PathObject]:
    """Iterate over "subpaths".

    Note: subpaths inherit the values of `fill` and `evenodd` from
    the parent path, but these values are no longer meaningful
    since the winding rules must be applied to the composite path
    as a whole (this is not a bug, just don't rely on them to know
    which regions are filled or not).

    """
    # FIXME: Is there an itertool or a more_itertool for this?
    segs: List[PLAYAPathSegment] = []
    for seg in path.raw_segments:
        if seg.operator == "m" and segs:
            yield PathObject(
                _pageref=path._pageref,
                _parentkey=path._parentkey,
                gstate=path.gstate,
                ctm=path.ctm,
                mcstack=path.mcstack,
                raw_segments=segs,
                stroke=path.stroke,
                fill=path.fill,
                evenodd=path.evenodd,
            )
            segs = []
        segs.append(seg)
    if segs:
        yield PathObject(
            _pageref=path._pageref,
            _parentkey=path._parentkey,
            gstate=path.gstate,
            ctm=path.ctm,
            mcstack=path.mcstack,
            raw_segments=segs,
            stroke=path.stroke,
            fill=path.fill,
            evenodd=path.evenodd,
        )

`uniq(objs)`

Eliminates duplicated elements.

Source code in playa/miner.py

def uniq(objs: Iterable[_T]) -> Iterator[_T]:
    """Eliminates duplicated elements."""
    # Duplicated here means the same object (this horrible code was
    # horribly written without any notion of hashable or non-hashable
    # types, SMH)
    done: Set[int] = set()
    for obj in objs:
        if id(obj) in done:
            continue
        done.add(id(obj))
        yield obj

Working in the PDF mine

Reference

playa.miner

GraphicState dataclass

LAParams

LTAnno

LTChar

LTComponent

LTContainer

LTCurve

LTFigure

LTImage

LTItem

analyze(laparams)

LTLayoutContainer

group_textboxes(laparams, boxes)

group_textlines(laparams, lines)

LTLine

LTPage

LTRect

LTText

get_text()

LTTextBox

LTTextLine

LTTextLineHorizontal

find_neighbors(plane, ratio)

LTTextLineVertical

find_neighbors(plane, ratio)

NameTree

NumberTree

PDFDocument

destinations property

fonts property

names property

objects property

outline property

page_labels property

pages property

structure property

tokens property

__getitem__(objid)

__iter__()

__len__()

PDFObjRef

__init__(doc=None, objid=0)

PDFPage

annotations property

contents property

doc property

fonts property

glyphs property

height property

images property

marked_content property

parent_key property

paths property

streams property

structure property

texts property

tokens property

width property

xobjects property

__iter__()

extract_text(*, bbox=None)

extract_text_tagged(*, bbox=None)

extract_text_untagged(*, bbox=None)

flatten(filter_class=None, restrict_ops=None)

set_initial_ctm(space, rotate)

PDFTypeError

PDFValueError

PSLiteral

Plane

add(obj)

find(bbox)

remove(obj)

decode_text(s)

drange(v0, v1, d)

extract(path, laparams=None, max_workers=1, mp_context=None)

extract_page(page, laparams=None)

fsplit(pred, objs)

`playa.miner`

`GraphicState` `dataclass`

`LAParams`

`LTAnno`

`LTChar`

`LTComponent`

`LTContainer`

`LTCurve`

`LTFigure`

`LTImage`

`LTItem`

`analyze(laparams)`

`LTLayoutContainer`

`group_textboxes(laparams, boxes)`

`group_textlines(laparams, lines)`

`LTLine`

`LTPage`

`LTRect`

`LTText`

`get_text()`

`LTTextBox`

`LTTextLine`

`LTTextLineHorizontal`

`find_neighbors(plane, ratio)`

`LTTextLineVertical`

`find_neighbors(plane, ratio)`

`NameTree`

`NumberTree`

`PDFDocument`

`destinations` `property`

`fonts` `property`

`names` `property`

`objects` `property`

`outline` `property`

`page_labels` `property`

`pages` `property`

`structure` `property`

`tokens` `property`

`getitem(objid)`

`iter()`

`len()`

`PDFObjRef`

`init(doc=None, objid=0)`

`PDFPage`

`annotations` `property`

`contents` `property`

`doc` `property`

`fonts` `property`

`glyphs` `property`

`height` `property`

`images` `property`

`marked_content` `property`

`parent_key` `property`

`paths` `property`

`streams` `property`

`structure` `property`

`texts` `property`

`tokens` `property`

`width` `property`

`xobjects` `property`

`iter()`

`extract_text(*, bbox=None)`

`extract_text_tagged(*, bbox=None)`

`extract_text_untagged(*, bbox=None)`

`flatten(filter_class=None, restrict_ops=None)`

`set_initial_ctm(space, rotate)`

`PDFTypeError`

`PDFValueError`

`PSLiteral`

`Plane`

`add(obj)`

`find(bbox)`

`remove(obj)`

`decode_text(s)`

`drange(v0, v1, d)`

`extract(path, laparams=None, max_workers=1, mp_context=None)`

`extract_page(page, laparams=None)`

`fsplit(pred, objs)`

`make_path_segment(op, points)`

`process_object(obj)`