Skip to content

Working in the PDF mine

pdfminer.six is widely used for text extraction and layout analysis due to its liberal licensing terms. Unfortunately it is quite slow and contains many bugs. Now you can use PLAYA instead:

from playa.miner import extract, LAParams

laparams = LAParams()
for page in extract(path, laparams):
    # do something

This is generally faster than pdfminer.six. You can often make it even faster on large documents by running in parallel with the max_workers argument, which is the same as the one you will find in concurrent.futures.ProcessPoolExecutor. If you pass None it will use all your CPUs, but due to some unavoidable overhead, it usually doesn't help to use more than 2-4:

for page in extract(path, laparams, max_workers=2):
    # do something

There are a few differences with pdfminer.six (some might call them bug fixes):

  • By default, if you do not pass the laparams argument to extract, no layout analysis at all is done. This is different from extract_pages in pdfminer.six which will set some default parameters for you. If you don't see any LTTextBox items in your LTPage then this is why!
  • Rectangles are recognized correctly in some cases where pdfminer.six thought they were "curves".
  • Colours and colour spaces are the PLAYA versions, which do not correspond to what pdfminer.six gives you, because what pdfminer.six gives you is not useful and often wrong.
  • You have access to the list of enclosing marked content sections in every LTComponent, as the mcstack attribute.
  • Bounding boxes of rotated glyphs are the actual bounding box.

Probably more... but you didn't use any of that stuff anyway, you just wanted to get LTTextBoxes to feed to your hallucination factories.

Reference

playa.miner

Reimplementation of pdfminer.six layout analysis on top of PLAYA.

GraphicState dataclass

PDF graphics state (PDF 1.7 section 8.4) including text state (PDF 1.7 section 9.3.1), but excluding coordinate transformations.

Contrary to the pretensions of pdfminer.six, the text state is for the most part not at all separate from the graphics state, and can be updated outside the confines of BT and ET operators, thus there is no advantage and only confusion that comes from treating it separately.

The only state that does not persist outside BT / ET pairs is the text coordinate space (line matrix and text rendering matrix), and it is also the only part that is updated during iteration over a TextObject.

For historical reasons the main coordinate transformation matrix, though it is also part of the graphics state, is also stored separately.

Attributes:

Name Type Description
clipping_path None

The current clipping path (sec. 8.5.4)

linewidth float

Line width in user space units (sec. 8.4.3.2)

linecap int

Line cap style (sec. 8.4.3.3)

linejoin int

Line join style (sec. 8.4.3.4)

miterlimit float

Maximum length of mitered line joins (sec. 8.4.3.5)

dash DashPattern

Dash pattern for stroking (sec 8.4.3.6)

intent PSLiteral

Rendering intent (sec. 8.6.5.8)

stroke_adjustment bool

A flag specifying whether to compensate for possible rasterization effects when stroking a path with a line width that is small relative to the pixel resolution of the output device (sec. 10.7.5)

blend_mode Union[PSLiteral, List[PSLiteral]]

The current blend mode that shall be used in the transparent imaging model (sec. 11.3.5)

smask Union[None, Dict[str, PDFObject]]

A soft-mask dictionary (sec. 11.6.5.1) or None

salpha float

The constant shape or constant opacity value used for stroking operations (sec. 11.3.7.2 & 11.6.4.4)

nalpha float

The constant shape or constant opacity value used for non-stroking operations

alpha_source bool

A flag specifying whether the current soft mask and alpha constant parameters shall be interpreted as shape values (true) or opacity values (false). This flag also governs the interpretation of the SMask entry, if any, in an image dictionary

black_pt_comp PSLiteral

The black point compensation algorithm that shall be used when converting CIE-based colours (sec. 8.6.5.9)

flatness float

The precision with which curves shall be rendered on the output device (sec. 10.6.2)

scolor Color

Colour used for stroking operations

scs ColorSpace

Colour space used for stroking operations

ncolor Color

Colour used for non-stroking operations

ncs ColorSpace

Colour space used for non-stroking operations

font Union[Font, None]

The current font.

fontsize float

The "font size" parameter, which is not the font size in points as you might understand it, but rather a scaling factor applied to text space (so, it affects not only text size but position as well). Since most reasonable people find that behaviour rather confusing, this is often just 1.0, and PDFs rely on the text matrix to set the size of text.

charspace float

Extra spacing to add after each glyph, expressed in unscaled text space units, meaning it is not affected by fontsize. BUT it will be modified by scaling for horizontal writing mode (so, most of the time).

wordspace float

Extra spacing to add after a space glyph, defined very specifically as the glyph encoded by the single-byte character code 32 (SPOILER: it is probably a space). Also expressed in unscaled text space units, but modified by scaling.

scaling float

The horizontal scaling factor as defined by the PDF standard (that is, divided by 100).

leading float

The leading as defined by the PDF standard, in unscaled text space units.

render_mode int

The PDF rendering mode. The really important one here is 3, which means "don't render the text". You might want to use this to detect invisible text.

rise float

The text rise (superscript or subscript position), in unscaled text space units.

knockout bool

The text knockout flag, shall determine the behaviour of overlapping glyphs within a text object in the transparent imaging model (sec. 9.3.8)

Source code in playa/content.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
@dataclass
class GraphicState:
    """PDF graphics state (PDF 1.7 section 8.4) including text state
    (PDF 1.7 section 9.3.1), but excluding coordinate transformations.

    Contrary to the pretensions of pdfminer.six, the text state is for
    the most part not at all separate from the graphics state, and can
    be updated outside the confines of `BT` and `ET` operators, thus
    there is no advantage and only confusion that comes from treating
    it separately.

    The only state that does not persist outside `BT` / `ET` pairs is
    the text coordinate space (line matrix and text rendering matrix),
    and it is also the only part that is updated during iteration over
    a `TextObject`.

    For historical reasons the main coordinate transformation matrix,
    though it is also part of the graphics state, is also stored
    separately.

    Attributes:
      clipping_path: The current clipping path (sec. 8.5.4)
      linewidth: Line width in user space units (sec. 8.4.3.2)
      linecap: Line cap style (sec. 8.4.3.3)
      linejoin: Line join style (sec. 8.4.3.4)
      miterlimit: Maximum length of mitered line joins (sec. 8.4.3.5)
      dash: Dash pattern for stroking (sec 8.4.3.6)
      intent: Rendering intent (sec. 8.6.5.8)
      stroke_adjustment: A flag specifying whether to compensate for
        possible rasterization effects when stroking a path with a line
        width that is small relative to the pixel resolution of the output
        device (sec. 10.7.5)
      blend_mode: The current blend mode that shall be used in the
        transparent imaging model (sec. 11.3.5)
      smask: A soft-mask dictionary (sec. 11.6.5.1) or None
      salpha: The constant shape or constant opacity value used for
        stroking operations (sec. 11.3.7.2 & 11.6.4.4)
      nalpha: The constant shape or constant opacity value used for
        non-stroking operations
      alpha_source: A flag specifying whether the current soft mask and
        alpha constant parameters shall be interpreted as shape values
        (true) or opacity values (false). This flag also governs the
        interpretation of the SMask entry, if any, in an image dictionary
      black_pt_comp: The black point compensation algorithm that shall be
        used when converting CIE-based colours (sec. 8.6.5.9)
      flatness: The precision with which curves shall be rendered on
        the output device (sec. 10.6.2)
      scolor: Colour used for stroking operations
      scs: Colour space used for stroking operations
      ncolor: Colour used for non-stroking operations
      ncs: Colour space used for non-stroking operations
      font: The current font.
      fontsize: The "font size" parameter, which is **not** the font
        size in points as you might understand it, but rather a
        scaling factor applied to text space (so, it affects not only
        text size but position as well).  Since most reasonable people
        find that behaviour rather confusing, this is often just 1.0,
        and PDFs rely on the text matrix to set the size of text.
      charspace: Extra spacing to add after each glyph, expressed in
        unscaled text space units, meaning it is not affected by
        `fontsize`.  **BUT** it will be modified by `scaling` for
        horizontal writing mode (so, most of the time).
      wordspace: Extra spacing to add after a space glyph, defined
        very specifically as the glyph encoded by the single-byte
        character code 32 (SPOILER: it is probably a space).  Also
        expressed in unscaled text space units, but modified by
        `scaling`.
      scaling: The horizontal scaling factor as defined by the PDF
        standard (that is, divided by 100).
      leading: The leading as defined by the PDF standard, in unscaled
        text space units.
      render_mode: The PDF rendering mode.  The really important one
        here is 3, which means "don't render the text".  You might
        want to use this to detect invisible text.
      rise: The text rise (superscript or subscript position), in
        unscaled text space units.
      knockout: The text knockout flag, shall determine the behaviour of
        overlapping glyphs within a text object in the transparent imaging
        model (sec. 9.3.8)

    """

    clipping_path: None = None  # TODO
    linewidth: float = 1
    linecap: int = 0
    linejoin: int = 0
    miterlimit: float = 10
    dash: DashPattern = SOLID_LINE
    intent: PSLiteral = LITERAL_RELATIVE_COLORIMETRIC
    stroke_adjustment: bool = False
    blend_mode: Union[PSLiteral, List[PSLiteral]] = LITERAL_NORMAL
    smask: Union[None, Dict[str, PDFObject]] = None
    salpha: float = 1
    nalpha: float = 1
    alpha_source: bool = False
    black_pt_comp: PSLiteral = LITERAL_DEFAULT
    flatness: float = 1
    scolor: Color = BASIC_BLACK
    scs: ColorSpace = PREDEFINED_COLORSPACE["DeviceGray"]
    ncolor: Color = BASIC_BLACK
    ncs: ColorSpace = PREDEFINED_COLORSPACE["DeviceGray"]
    font: Union[Font, None] = None
    fontsize: float = 0
    charspace: float = 0
    wordspace: float = 0
    scaling: float = 100
    leading: float = 0
    render_mode: int = 0
    rise: float = 0
    knockout: bool = True

LAParams

Parameters for layout analysis

Parameters:

Name Type Description Default
line_overlap float

If two characters have more overlap than this they are considered to be on the same line. The overlap is specified relative to the minimum height of both characters.

0.5
char_margin float

If two characters are closer together than this margin they are considered part of the same line. The margin is specified relative to the width of the character.

2.0
word_margin float

If two characters on the same line are further apart than this margin then they are considered to be two separate words, and an intermediate space will be added for readability. The margin is specified relative to the width of the character.

0.1
line_margin float

If two lines are are close together they are considered to be part of the same paragraph. The margin is specified relative to the height of a line.

0.5
boxes_flow Optional[float]

Specifies how much a horizontal and vertical position of a text matters when determining the order of text boxes. The value should be within the range of -1.0 (only horizontal position matters) to +1.0 (only vertical position matters). You can also pass None to disable advanced layout analysis, and instead return text based on the position of the bottom left corner of the text box.

0.5
detect_vertical bool

If vertical text should be considered during layout analysis

False
all_texts bool

If layout analysis should be performed on text in figures.

False
Source code in playa/miner.py
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
class LAParams:
    """Parameters for layout analysis

    :param line_overlap: If two characters have more overlap than this they
        are considered to be on the same line. The overlap is specified
        relative to the minimum height of both characters.
    :param char_margin: If two characters are closer together than this
        margin they are considered part of the same line. The margin is
        specified relative to the width of the character.
    :param word_margin: If two characters on the same line are further apart
        than this margin then they are considered to be two separate words, and
        an intermediate space will be added for readability. The margin is
        specified relative to the width of the character.
    :param line_margin: If two lines are are close together they are
        considered to be part of the same paragraph. The margin is
        specified relative to the height of a line.
    :param boxes_flow: Specifies how much a horizontal and vertical position
        of a text matters when determining the order of text boxes. The value
        should be within the range of -1.0 (only horizontal position
        matters) to +1.0 (only vertical position matters). You can also pass
        `None` to disable advanced layout analysis, and instead return text
        based on the position of the bottom left corner of the text box.
    :param detect_vertical: If vertical text should be considered during
        layout analysis
    :param all_texts: If layout analysis should be performed on text in
        figures.
    """

    def __init__(
        self,
        line_overlap: float = 0.5,
        char_margin: float = 2.0,
        line_margin: float = 0.5,
        word_margin: float = 0.1,
        boxes_flow: Optional[float] = 0.5,
        detect_vertical: bool = False,
        all_texts: bool = False,
    ) -> None:
        self.line_overlap = line_overlap
        self.char_margin = char_margin
        self.line_margin = line_margin
        self.word_margin = word_margin
        self.boxes_flow = boxes_flow
        self.detect_vertical = detect_vertical
        self.all_texts = all_texts

        self._validate()

    def _validate(self) -> None:
        if self.boxes_flow is not None:
            boxes_flow_err_msg = (
                "LAParam boxes_flow should be None, or a number between -1 and +1"
            )
            if not isinstance(self.boxes_flow, (int, float)):
                raise PDFTypeError(boxes_flow_err_msg)
            if not -1 <= self.boxes_flow <= 1:
                raise PDFValueError(boxes_flow_err_msg)

    def __repr__(self) -> str:
        return (
            "<LAParams: char_margin=%.1f, line_margin=%.1f, "
            "word_margin=%.1f all_texts=%r>"
            % (self.char_margin, self.line_margin, self.word_margin, self.all_texts)
        )

LTAnno

Bases: LTItem, LTText

Actual letter in the text as a Unicode string.

Note that, while a LTChar object has actual boundaries, LTAnno objects does not, as these are "virtual" characters, inserted by a layout analyzer according to the relationship between two characters (e.g. a space).

Source code in playa/miner.py
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
class LTAnno(LTItem, LTText):
    """Actual letter in the text as a Unicode string.

    Note that, while a LTChar object has actual boundaries, LTAnno objects does
    not, as these are "virtual" characters, inserted by a layout analyzer
    according to the relationship between two characters (e.g. a space).
    """

    def __init__(self, text: Union[str, None] = None) -> None:
        if text is None:
            # No initialization, for pickling purposes
            return
        self._text = text

    def get_text(self) -> str:
        return self._text

LTChar

Bases: LTComponent, LTText

Actual letter in the text as a Unicode string.

Source code in playa/miner.py
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
class LTChar(LTComponent, LTText):
    """Actual letter in the text as a Unicode string."""

    def __init__(
        self,
        glyph: Union[GlyphObject, None] = None,
    ) -> None:
        super().__init__()
        if glyph is None:
            # No initialization, for pickling purposes
            return
        gstate = glyph.gstate
        matrix = glyph.matrix
        font = glyph.font
        if glyph.text is None:
            logger.debug("undefined: %r, %r", font, glyph.cid)
            # Horrible awful pdfminer.six behaviour
            self._text = "(cid:%d)" % glyph.cid
        else:
            self._text = glyph.text
        self.mcstack = glyph.mcstack
        self.fontname = font.fontname
        self.graphicstate = gstate
        self.render_mode = gstate.render_mode
        self.stroking_color = gstate.scolor
        self.non_stroking_color = gstate.ncolor
        self.scs = gstate.scs
        self.ncs = gstate.ncs
        scaling = gstate.scaling * 0.01
        fontsize = gstate.fontsize
        (a, b, c, d, e, f) = matrix
        # FIXME: Still really not sure what this means
        self.upright = a * d * scaling > 0 and b * c <= 0
        # Unscale the matrix to match pdfminer.six
        xscale = 1 / (fontsize * scaling)
        yscale = 1 / fontsize
        self.matrix = (a * xscale, b * yscale, c * xscale, d * yscale, e, f)
        # Recreate pdfminer.six's bogus bboxes
        if font.vertical:
            vdisp = font.vdisp(glyph.cid)
            self.adv = vdisp * fontsize
            vx, vy = font.position(glyph.cid)
            textbox = (-vx, vy + vdisp, -vx + 1, vy)
        else:
            textwidth = font.hdisp(glyph.cid)
            self.adv = textwidth * fontsize * scaling
            descent = font.descent * font.matrix[3]
            textbox = (0, descent, textwidth, descent + 1)
        miner_box = transform_bbox(matrix, textbox)
        super().__init__(miner_box, glyph.mcstack)
        # FIXME: This is quite wrong for rotated glyphs, but so is pdfminer.six
        if font.vertical:
            self.size = self.width
        else:
            self.size = self.height

    def __repr__(self) -> str:
        return (
            f"<{self.__class__.__name__} {bbox2str(self.bbox)} "
            f"matrix={matrix2str(self.matrix)} font={self.fontname!r} "
            f"adv={self.adv} text={self.get_text()!r}>"
        )

    def get_text(self) -> str:
        return self._text

LTComponent

Bases: LTItem

Object with a bounding box

Source code in playa/miner.py
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
class LTComponent(LTItem):
    """Object with a bounding box"""

    def __init__(
        self, bbox: Union[Rect, None] = None, mcstack: Tuple[MarkedContent, ...] = ()
    ) -> None:
        if bbox is None:
            # No initialization, for pickling purposes (see
            # https://mypyc.readthedocs.io/en/latest/differences_from_python.html#pickling-and-copying-objects)
            return
        self.set_bbox(bbox)
        self.mcstack = mcstack

    def __repr__(self) -> str:
        return f"<{self.__class__.__name__} {bbox2str(self.bbox)}>"

    def set_bbox(self, bbox: Rect) -> None:
        (x0, y0, x1, y1) = bbox
        self.x0 = x0
        self.y0 = y0
        self.x1 = x1
        self.y1 = y1
        self.width = x1 - x0
        self.height = y1 - y0
        self.bbox = bbox

    def is_empty(self) -> bool:
        return self.width <= 0 or self.height <= 0

    def is_hoverlap(self, obj: "LTComponent") -> bool:
        return obj.x0 <= self.x1 and self.x0 <= obj.x1

    def hdistance(self, obj: "LTComponent") -> float:
        if self.is_hoverlap(obj):
            return 0
        else:
            return min(abs(self.x0 - obj.x1), abs(self.x1 - obj.x0))

    def hoverlap(self, obj: "LTComponent") -> float:
        if self.is_hoverlap(obj):
            return min(abs(self.x0 - obj.x1), abs(self.x1 - obj.x0))
        else:
            return 0

    def is_voverlap(self, obj: "LTComponent") -> bool:
        return obj.y0 <= self.y1 and self.y0 <= obj.y1

    def vdistance(self, obj: "LTComponent") -> float:
        if self.is_voverlap(obj):
            return 0
        else:
            return min(abs(self.y0 - obj.y1), abs(self.y1 - obj.y0))

    def voverlap(self, obj: "LTComponent") -> float:
        if self.is_voverlap(obj):
            return min(abs(self.y0 - obj.y1), abs(self.y1 - obj.y0))
        else:
            return 0

LTContainer

Bases: LTComponent, Generic[LTItemT]

Object that can be extended and analyzed

Source code in playa/miner.py
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
class LTContainer(LTComponent, Generic[LTItemT]):
    """Object that can be extended and analyzed"""

    def __init__(
        self, bbox: Union[Rect, None] = None, mcstack: Tuple[MarkedContent, ...] = ()
    ) -> None:
        if bbox is None:
            # No initialization, for pickling purposes
            return
        super().__init__(bbox, mcstack)
        self._objs: List[LTItemT] = []

    def __iter__(self) -> Iterator[LTItemT]:
        return iter(self._objs)

    def __len__(self) -> int:
        return len(self._objs)

    def add(self, obj: LTItemT) -> None:
        self._objs.append(obj)

    def extend(self, objs: Iterable[LTItemT]) -> None:
        for obj in objs:
            self.add(obj)

    def analyze(self, laparams: LAParams) -> None:
        for obj in self._objs:
            obj.analyze(laparams)

LTCurve

Bases: LTComponent

A generic Bezier curve

The parameter original_path contains the original pathing information from the pdf (e.g. for reconstructing Bezier Curves).

dashing_style contains the Dashing information if any.

Source code in playa/miner.py
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
class LTCurve(LTComponent):
    """A generic Bezier curve

    The parameter `original_path` contains the original
    pathing information from the pdf (e.g. for reconstructing Bezier Curves).

    `dashing_style` contains the Dashing information if any.
    """

    def __init__(
        self,
        path: Union[PathObject, None] = None,
        pts: List[Point] = [],  # These are actually immutable so not a problem
        transformed_path: List[PathSegment] = [],
    ) -> None:
        if path is None:
            # No initialization, for pickling purposes
            return
        super().__init__(get_bound(pts), path.mcstack)
        self.pts = pts
        self.linewidth = path.gstate.linewidth
        self.stroke = path.stroke
        self.fill = path.fill
        self.evenodd = path.evenodd
        gstate = path.gstate
        self.graphicstate = gstate
        self.stroking_color = gstate.scolor
        self.non_stroking_color = gstate.ncolor
        self.scs = gstate.scs
        self.ncs = gstate.ncs
        self.original_path = transformed_path
        self.dashing_style = gstate.dash

    def get_pts(self) -> str:
        return ",".join("%.3f,%.3f" % p for p in self.pts)

LTFigure

Bases: LTLayoutContainer

Represents an area used by PDF Form objects.

PDF Forms can be used to present figures or pictures by embedding yet another PDF document within a page. Note that LTFigure objects can appear recursively.

Source code in playa/miner.py
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
class LTFigure(LTLayoutContainer):
    """Represents an area used by PDF Form objects.

    PDF Forms can be used to present figures or pictures by embedding yet
    another PDF document within a page. Note that LTFigure objects can appear
    recursively.
    """

    def __init__(self, obj: Union[ImageObject, XObjectObject, None] = None) -> None:
        if obj is None:
            # No initialization, for pickling purposes
            return
        if obj.xobjid is None:
            self.name = str(id(obj))
        else:
            self.name = obj.xobjid
        self.matrix = obj.ctm
        super().__init__(obj.bbox, obj.mcstack)

    def __repr__(self) -> str:
        return (
            f"<{self.__class__.__name__}({self.name}) "
            f"{bbox2str(self.bbox)} matrix={matrix2str(self.matrix)}>"
        )

    def analyze(self, laparams: LAParams) -> None:
        if not laparams.all_texts:
            return
        LTLayoutContainer.analyze(self, laparams)

LTImage

Bases: LTComponent

An image object.

Embedded images can be in JPEG, Bitmap or JBIG2.

Source code in playa/miner.py
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
class LTImage(LTComponent):
    """An image object.

    Embedded images can be in JPEG, Bitmap or JBIG2.
    """

    def __init__(self, obj: Union[ImageObject, None] = None) -> None:
        if obj is None:
            # No initialization, for pickling purposes
            return
        super().__init__(obj.bbox, obj.mcstack)
        # Inline images don't actually have an xobjid, so we make shit
        # up like pdfminer.six does.
        if obj.xobjid is None:
            self.name = str(id(obj))
        else:
            self.name = obj.xobjid
        self.stream = obj.stream
        self.srcsize = obj.srcsize
        self.imagemask = obj.imagemask
        self.bits = obj.bits
        self.colorspace = obj.colorspace

    def __repr__(self) -> str:
        return (
            f"<{self.__class__.__name__}({self.name})"
            f" {bbox2str(self.bbox)} {self.srcsize!r}>"
        )

LTItem

Interface for things that can be analyzed

Source code in playa/miner.py
326
327
328
329
330
331
@trait
class LTItem:
    """Interface for things that can be analyzed"""

    def analyze(self, laparams: LAParams) -> None:
        """Perform the layout analysis."""

analyze(laparams)

Perform the layout analysis.

Source code in playa/miner.py
330
331
def analyze(self, laparams: LAParams) -> None:
    """Perform the layout analysis."""

LTLayoutContainer

Bases: LTContainer[LTComponent]

Source code in playa/miner.py
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
class LTLayoutContainer(LTContainer[LTComponent]):
    def __init__(
        self, bbox: Union[Rect, None] = None, mcstack: Tuple[MarkedContent, ...] = ()
    ) -> None:
        if bbox is None:
            # No initialization, for pickling purposes
            return
        super().__init__(bbox, mcstack)
        self.groups: Optional[List[LTTextGroup]] = None

    # group_objects: group text object to textlines.
    def group_objects(
        self,
        laparams: LAParams,
        objs: Iterable[LTComponent],
    ) -> Iterator[LTTextLine]:
        obj0: Any = None
        line: Any = None
        for obj1 in objs:
            if obj0 is not None:
                # halign: obj0 and obj1 is horizontally aligned.
                #
                #   +------+ - - -
                #   | obj0 | - - +------+   -
                #   |      |     | obj1 |   | (line_overlap)
                #   +------+ - - |      |   -
                #          - - - +------+
                #
                #          |<--->|
                #        (char_margin)
                halign = (
                    obj0.is_voverlap(obj1)
                    and min(obj0.height, obj1.height) * laparams.line_overlap
                    < obj0.voverlap(obj1)
                    and obj0.hdistance(obj1)
                    < max(obj0.width, obj1.width) * laparams.char_margin
                )

                # valign: obj0 and obj1 is vertically aligned.
                #
                #   +------+
                #   | obj0 |
                #   |      |
                #   +------+ - - -
                #     |    |     | (char_margin)
                #     +------+ - -
                #     | obj1 |
                #     |      |
                #     +------+
                #
                #     |<-->|
                #   (line_overlap)
                valign = (
                    laparams.detect_vertical
                    and obj0.is_hoverlap(obj1)
                    and min(obj0.width, obj1.width) * laparams.line_overlap
                    < obj0.hoverlap(obj1)
                    and obj0.vdistance(obj1)
                    < max(obj0.height, obj1.height) * laparams.char_margin
                )

                if (halign and isinstance(line, LTTextLineHorizontal)) or (
                    valign and isinstance(line, LTTextLineVertical)
                ):
                    line.add(obj1)
                elif line is not None:
                    yield line
                    line = None
                elif valign and not halign:
                    line = LTTextLineVertical(laparams.word_margin)
                    line.add(obj0)
                    line.add(obj1)
                elif halign and not valign:
                    line = LTTextLineHorizontal(laparams.word_margin)
                    line.add(obj0)
                    line.add(obj1)
                else:
                    line = LTTextLineHorizontal(laparams.word_margin)
                    line.add(obj0)
                    yield line
                    line = None
            obj0 = obj1
        if line is None:
            line = LTTextLineHorizontal(laparams.word_margin)
            assert obj0 is not None
            line.add(obj0)
        yield line

    def group_textlines(
        self,
        laparams: LAParams,
        lines: Iterable[LTTextLine],
    ) -> Iterator[LTTextBox]:
        """Group neighboring lines to textboxes"""
        plane: Plane[LTTextLine] = Plane(self.bbox)
        plane.extend(lines)
        boxes: Dict[int, LTTextBox] = {}
        for line in lines:
            neighbors = line.find_neighbors(plane, laparams.line_margin)
            members = [line]
            for obj1 in neighbors:
                members.append(obj1)
                if id(obj1) in boxes:
                    members.extend(boxes[id(obj1)])
                    del boxes[id(obj1)]
            if isinstance(line, LTTextLineHorizontal):
                box: LTTextBox = LTTextBoxHorizontal()
            else:
                box = LTTextBoxVertical()
            for obj in uniq(members):
                box.add(obj)
                boxes[id(obj)] = box
        done: Set[int] = set()
        for line in lines:
            if id(line) not in boxes:
                continue
            box = boxes[id(line)]
            if id(box) in done:
                continue
            done.add(id(box))
            if not box.is_empty():
                yield box

    def group_textboxes(
        self,
        laparams: LAParams,
        boxes: Sequence[LTTextBox],
    ) -> List[LTTextGroup]:
        """Group textboxes hierarchically.

        Get pair-wise distances, via dist func defined below, and then merge
        from the closest textbox pair. Once obj1 and obj2 are merged /
        grouped, the resulting group is considered as a new object, and its
        distances to other objects & groups are added to the process queue.

        For performance reason, pair-wise distances and object pair info are
        maintained in a heap of (idx, dist, id(obj1), id(obj2), obj1, obj2)
        tuples. It ensures quick access to the smallest element. Note that
        since comparison operators, e.g., __lt__, are disabled for
        LTComponent, id(obj) has to appear before obj in element tuples.

        :param laparams: LAParams object.
        :param boxes: All textbox objects to be grouped.
        :return: a list that has only one element, the final top level group.
        """
        ElementT = Union[LTTextBox, LTTextGroup]
        plane: Plane[ElementT] = Plane(self.bbox)

        def dist(obj1: LTComponent, obj2: LTComponent) -> float:
            """A distance function between two TextBoxes.

            Consider the bounding rectangle for obj1 and obj2.
            Return its area less the areas of obj1 and obj2,
            shown as 'www' below. This value may be negative.
                    +------+..........+ (x1, y1)
                    | obj1 |wwwwwwwwww:
                    +------+www+------+
                    :wwwwwwwwww| obj2 |
            (x0, y0) +..........+------+
            """
            x0 = min(obj1.x0, obj2.x0)
            y0 = min(obj1.y0, obj2.y0)
            x1 = max(obj1.x1, obj2.x1)
            y1 = max(obj1.y1, obj2.y1)
            return (
                (x1 - x0) * (y1 - y0)
                - obj1.width * obj1.height
                - obj2.width * obj2.height
            )

        def isany(obj1: ElementT, obj2: ElementT) -> bool:
            """Check if there's any other object between obj1 and obj2."""
            x0 = min(obj1.x0, obj2.x0)
            y0 = min(obj1.y0, obj2.y0)
            x1 = max(obj1.x1, obj2.x1)
            y1 = max(obj1.y1, obj2.y1)
            for obj in plane.find((x0, y0, x1, y1)):
                if obj not in (obj1, obj2):
                    break
            else:
                return False
            return True

        # If there's only one box, no grouping need be done, but we
        # should still always return a group!
        if len(boxes) == 1:
            return [LTTextGroup(boxes)]

        dists: List[Tuple[bool, float, int, int, ElementT, ElementT]] = []
        for i in range(len(boxes)):
            box1 = boxes[i]
            for j in range(i + 1, len(boxes)):
                box2 = boxes[j]
                dists.append((False, dist(box1, box2), id(box1), id(box2), box1, box2))
        heapq.heapify(dists)

        plane.extend(boxes)
        done: Set[int] = set()
        while len(dists) > 0:
            (skip_isany, d, id1, id2, obj1, obj2) = heapq.heappop(dists)
            # Skip objects that are already merged
            if (id1 in done) or (id2 in done):
                continue
            if not skip_isany and isany(obj1, obj2):
                heapq.heappush(dists, (True, d, id1, id2, obj1, obj2))
                continue
            if isinstance(obj1, (LTTextBoxVertical, LTTextGroupTBRL)) or isinstance(
                obj2,
                (LTTextBoxVertical, LTTextGroupTBRL),
            ):
                group: LTTextGroup = LTTextGroupTBRL([obj1, obj2])
            else:
                group = LTTextGroupLRTB([obj1, obj2])
            plane.remove(obj1)
            done.add(id1)
            plane.remove(obj2)
            done.add(id2)

            for other in plane:
                heapq.heappush(
                    dists,
                    (False, dist(group, other), id(group), id(other), group, other),
                )
            plane.add(group)
        # The plane should now only contain groups, otherwise it's a bug
        groups: List[LTTextGroup] = []
        for g in plane:
            assert isinstance(g, LTTextGroup)
            groups.append(g)
        return groups

    def analyze(self, laparams: LAParams) -> None:
        # textobjs is a list of LTChar objects, i.e.
        # it has all the individual characters in the page.
        (textobjs, otherobjs) = fsplit(lambda obj: isinstance(obj, LTChar), self)
        for obj in otherobjs:
            obj.analyze(laparams)
        if not textobjs:
            return
        textlines = list(self.group_objects(laparams, textobjs))
        (empties, textlines) = fsplit(lambda obj: obj.is_empty(), textlines)
        for obj in empties:
            obj.analyze(laparams)
        textboxes = list(self.group_textlines(laparams, textlines))
        if laparams.boxes_flow is None:
            for textbox in textboxes:
                textbox.analyze(laparams)

            def getkey(box: LTTextBox) -> Tuple[int, float, float]:
                if isinstance(box, LTTextBoxVertical):
                    return (0, -box.x1, -box.y0)
                else:
                    return (1, -box.y0, box.x0)

            textboxes.sort(key=getkey)
        else:
            self.groups = self.group_textboxes(laparams, textboxes)
            assigner = IndexAssigner()
            for group in self.groups:
                group.analyze(laparams)
                assigner.run(group)
            textboxes.sort(key=lambda box: box.index)
        self._objs = (
            cast(List[LTComponent], textboxes)
            + otherobjs
            + cast(List[LTComponent], empties)
        )

group_textboxes(laparams, boxes)

Group textboxes hierarchically.

Get pair-wise distances, via dist func defined below, and then merge from the closest textbox pair. Once obj1 and obj2 are merged / grouped, the resulting group is considered as a new object, and its distances to other objects & groups are added to the process queue.

For performance reason, pair-wise distances and object pair info are maintained in a heap of (idx, dist, id(obj1), id(obj2), obj1, obj2) tuples. It ensures quick access to the smallest element. Note that since comparison operators, e.g., lt, are disabled for LTComponent, id(obj) has to appear before obj in element tuples.

Parameters:

Name Type Description Default
laparams LAParams

LAParams object.

required
boxes Sequence[LTTextBox]

All textbox objects to be grouped.

required

Returns:

Type Description
List[LTTextGroup]

a list that has only one element, the final top level group.

Source code in playa/miner.py
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
def group_textboxes(
    self,
    laparams: LAParams,
    boxes: Sequence[LTTextBox],
) -> List[LTTextGroup]:
    """Group textboxes hierarchically.

    Get pair-wise distances, via dist func defined below, and then merge
    from the closest textbox pair. Once obj1 and obj2 are merged /
    grouped, the resulting group is considered as a new object, and its
    distances to other objects & groups are added to the process queue.

    For performance reason, pair-wise distances and object pair info are
    maintained in a heap of (idx, dist, id(obj1), id(obj2), obj1, obj2)
    tuples. It ensures quick access to the smallest element. Note that
    since comparison operators, e.g., __lt__, are disabled for
    LTComponent, id(obj) has to appear before obj in element tuples.

    :param laparams: LAParams object.
    :param boxes: All textbox objects to be grouped.
    :return: a list that has only one element, the final top level group.
    """
    ElementT = Union[LTTextBox, LTTextGroup]
    plane: Plane[ElementT] = Plane(self.bbox)

    def dist(obj1: LTComponent, obj2: LTComponent) -> float:
        """A distance function between two TextBoxes.

        Consider the bounding rectangle for obj1 and obj2.
        Return its area less the areas of obj1 and obj2,
        shown as 'www' below. This value may be negative.
                +------+..........+ (x1, y1)
                | obj1 |wwwwwwwwww:
                +------+www+------+
                :wwwwwwwwww| obj2 |
        (x0, y0) +..........+------+
        """
        x0 = min(obj1.x0, obj2.x0)
        y0 = min(obj1.y0, obj2.y0)
        x1 = max(obj1.x1, obj2.x1)
        y1 = max(obj1.y1, obj2.y1)
        return (
            (x1 - x0) * (y1 - y0)
            - obj1.width * obj1.height
            - obj2.width * obj2.height
        )

    def isany(obj1: ElementT, obj2: ElementT) -> bool:
        """Check if there's any other object between obj1 and obj2."""
        x0 = min(obj1.x0, obj2.x0)
        y0 = min(obj1.y0, obj2.y0)
        x1 = max(obj1.x1, obj2.x1)
        y1 = max(obj1.y1, obj2.y1)
        for obj in plane.find((x0, y0, x1, y1)):
            if obj not in (obj1, obj2):
                break
        else:
            return False
        return True

    # If there's only one box, no grouping need be done, but we
    # should still always return a group!
    if len(boxes) == 1:
        return [LTTextGroup(boxes)]

    dists: List[Tuple[bool, float, int, int, ElementT, ElementT]] = []
    for i in range(len(boxes)):
        box1 = boxes[i]
        for j in range(i + 1, len(boxes)):
            box2 = boxes[j]
            dists.append((False, dist(box1, box2), id(box1), id(box2), box1, box2))
    heapq.heapify(dists)

    plane.extend(boxes)
    done: Set[int] = set()
    while len(dists) > 0:
        (skip_isany, d, id1, id2, obj1, obj2) = heapq.heappop(dists)
        # Skip objects that are already merged
        if (id1 in done) or (id2 in done):
            continue
        if not skip_isany and isany(obj1, obj2):
            heapq.heappush(dists, (True, d, id1, id2, obj1, obj2))
            continue
        if isinstance(obj1, (LTTextBoxVertical, LTTextGroupTBRL)) or isinstance(
            obj2,
            (LTTextBoxVertical, LTTextGroupTBRL),
        ):
            group: LTTextGroup = LTTextGroupTBRL([obj1, obj2])
        else:
            group = LTTextGroupLRTB([obj1, obj2])
        plane.remove(obj1)
        done.add(id1)
        plane.remove(obj2)
        done.add(id2)

        for other in plane:
            heapq.heappush(
                dists,
                (False, dist(group, other), id(group), id(other), group, other),
            )
        plane.add(group)
    # The plane should now only contain groups, otherwise it's a bug
    groups: List[LTTextGroup] = []
    for g in plane:
        assert isinstance(g, LTTextGroup)
        groups.append(g)
    return groups

group_textlines(laparams, lines)

Group neighboring lines to textboxes

Source code in playa/miner.py
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
def group_textlines(
    self,
    laparams: LAParams,
    lines: Iterable[LTTextLine],
) -> Iterator[LTTextBox]:
    """Group neighboring lines to textboxes"""
    plane: Plane[LTTextLine] = Plane(self.bbox)
    plane.extend(lines)
    boxes: Dict[int, LTTextBox] = {}
    for line in lines:
        neighbors = line.find_neighbors(plane, laparams.line_margin)
        members = [line]
        for obj1 in neighbors:
            members.append(obj1)
            if id(obj1) in boxes:
                members.extend(boxes[id(obj1)])
                del boxes[id(obj1)]
        if isinstance(line, LTTextLineHorizontal):
            box: LTTextBox = LTTextBoxHorizontal()
        else:
            box = LTTextBoxVertical()
        for obj in uniq(members):
            box.add(obj)
            boxes[id(obj)] = box
    done: Set[int] = set()
    for line in lines:
        if id(line) not in boxes:
            continue
        box = boxes[id(line)]
        if id(box) in done:
            continue
        done.add(id(box))
        if not box.is_empty():
            yield box

LTLine

Bases: LTCurve

A single straight line.

Could be used for separating text or figures.

Source code in playa/miner.py
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
class LTLine(LTCurve):
    """A single straight line.

    Could be used for separating text or figures.
    """

    def __init__(
        self,
        path: Union[PathObject, None] = None,
        p0: Point = (0, 0),
        p1: Point = (0, 0),
        transformed_path: List[PathSegment] = [],
    ) -> None:
        if path is None:
            # No initialization, for pickling purposes
            return
        LTCurve.__init__(
            self,
            path,
            [p0, p1],
            transformed_path,
        )

LTPage

Bases: LTLayoutContainer

Represents an entire page.

Like any other LTLayoutContainer, an LTPage can be iterated to obtain child objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.

Source code in playa/miner.py
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
class LTPage(LTLayoutContainer):
    """Represents an entire page.

    Like any other LTLayoutContainer, an LTPage can be iterated to obtain child
    objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.
    """

    def __init__(self, pageid: int, bbox: Rect, rotate: float = 0) -> None:
        super().__init__(bbox, ())
        self.pageid = pageid
        self.rotate = rotate

    def __repr__(self) -> str:
        return (
            f"<{self.__class__.__name__}({self.pageid!r}) "
            f"{bbox2str(self.bbox)} rotate={self.rotate!r}>"
        )

LTRect

Bases: LTCurve

A rectangle.

Could be used for framing another pictures or figures.

Source code in playa/miner.py
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
class LTRect(LTCurve):
    """A rectangle.

    Could be used for framing another pictures or figures.
    """

    def __init__(
        self,
        path: Union[PathObject, None] = None,
        bbox: Rect = (0, 0, 0, 0),
        transformed_path: List[PathSegment] = [],
    ) -> None:
        if path is None:
            # No initialization, for pickling purposes
            return
        (x0, y0, x1, y1) = bbox
        LTCurve.__init__(
            self,
            path,
            [(x0, y0), (x1, y0), (x1, y1), (x0, y1)],
            transformed_path,
        )

LTText

Interface for things that have text

Source code in playa/miner.py
334
335
336
337
338
339
340
341
342
343
@trait
class LTText:
    """Interface for things that have text"""

    def __repr__(self) -> str:
        return f"<{self.__class__.__name__} {self.get_text()!r}>"

    def get_text(self) -> str:
        """Text contained in this object"""
        raise NotImplementedError

get_text()

Text contained in this object

Source code in playa/miner.py
341
342
343
def get_text(self) -> str:
    """Text contained in this object"""
    raise NotImplementedError

LTTextBox

Bases: LTTextContainer[LTTextLine]

Represents a group of text chunks in a rectangular area.

Note that this box is created by geometric analysis and does not necessarily represents a logical boundary of the text. It contains a list of LTTextLine objects.

Source code in playa/miner.py
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
class LTTextBox(LTTextContainer[LTTextLine]):
    """Represents a group of text chunks in a rectangular area.

    Note that this box is created by geometric analysis and does not
    necessarily represents a logical boundary of the text. It contains a list
    of LTTextLine objects.
    """

    def __init__(self) -> None:
        super().__init__()
        self.index: int = -1

    def __repr__(self) -> str:
        return (
            f"<{self.__class__.__name__}({self.index}) "
            f"{bbox2str(self.bbox)} {self.get_text()!r}>"
        )

    def get_writing_mode(self) -> str:
        raise NotImplementedError

LTTextLine

Bases: LTTextContainer[TextLineElement]

Contains a list of LTChar objects that represent a single text line.

The characters are aligned either horizontally or vertically, depending on the text's writing mode.

Source code in playa/miner.py
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
class LTTextLine(LTTextContainer[TextLineElement]):
    """Contains a list of LTChar objects that represent a single text line.

    The characters are aligned either horizontally or vertically, depending on
    the text's writing mode.
    """

    def __init__(self, word_margin: float = 0.0) -> None:
        super().__init__()
        self.word_margin = word_margin

    def __repr__(self) -> str:
        return f"<{self.__class__.__name__} {bbox2str(self.bbox)} {self.get_text()!r}>"

    def analyze(self, laparams: LAParams) -> None:
        for obj in self._objs:
            obj.analyze(laparams)
        # FIXME: Should probably inherit mcstack somehow
        LTContainer.add(self, LTAnno("\n"))

    def find_neighbors(
        self,
        plane: Plane[LTComponentT],
        ratio: float,
    ) -> List["LTTextLine"]:
        raise NotImplementedError

    def is_empty(self) -> bool:
        return super().is_empty() or self.get_text().isspace()

LTTextLineHorizontal

Bases: LTTextLine

Source code in playa/miner.py
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
class LTTextLineHorizontal(LTTextLine):
    def __init__(self, word_margin: float = 0.0) -> None:
        super().__init__(word_margin)
        self._x1 = +INF + 0.0

    # Incompatible override: we take an LTComponent (with bounding box), but
    # LTContainer only considers LTItem (no bounding box).
    def add(self, obj: LTComponent) -> None:  # type: ignore[override]
        if isinstance(obj, LTChar) and self.word_margin:
            margin = self.word_margin * max(obj.width, obj.height)
            if self._x1 < obj.x0 - margin:
                # FIXME: Should probably inherit mcstack somehow
                LTContainer.add(self, LTAnno(" "))
        self._x1 = obj.x1
        super().add(obj)

    def find_neighbors(
        self,
        plane: Plane[LTComponentT],
        ratio: float,
    ) -> List[LTTextLine]:
        """Finds neighboring LTTextLineHorizontals in the plane.

        Returns a list of other LTTestLineHorizontals in the plane which are
        close to self. "Close" can be controlled by ratio. The returned objects
        will be the same height as self, and also either left-, right-, or
        centrally-aligned.
        """
        d = ratio * self.height
        objs = plane.find((self.x0, self.y0 - d, self.x1, self.y1 + d))
        return [
            obj
            for obj in objs
            if (
                isinstance(obj, LTTextLineHorizontal)
                and self._is_same_height_as(obj, tolerance=d)
                and (
                    self._is_left_aligned_with(obj, tolerance=d)
                    or self._is_right_aligned_with(obj, tolerance=d)
                    or self._is_centrally_aligned_with(obj, tolerance=d)
                )
            )
        ]

    def _is_left_aligned_with(self, other: LTComponent, tolerance: float = 0.0) -> bool:
        """Whether the left-hand edge of `other` is within `tolerance`."""
        return abs(other.x0 - self.x0) <= tolerance

    def _is_right_aligned_with(
        self, other: LTComponent, tolerance: float = 0.0
    ) -> bool:
        """Whether the right-hand edge of `other` is within `tolerance`."""
        return abs(other.x1 - self.x1) <= tolerance

    def _is_centrally_aligned_with(
        self,
        other: LTComponent,
        tolerance: float = 0,
    ) -> bool:
        """Whether the horizontal center of `other` is within `tolerance`."""
        return abs((other.x0 + other.x1) / 2 - (self.x0 + self.x1) / 2) <= tolerance

    def _is_same_height_as(self, other: LTComponent, tolerance: float = 0) -> bool:
        return abs(other.height - self.height) <= tolerance

find_neighbors(plane, ratio)

Finds neighboring LTTextLineHorizontals in the plane.

Returns a list of other LTTestLineHorizontals in the plane which are close to self. "Close" can be controlled by ratio. The returned objects will be the same height as self, and also either left-, right-, or centrally-aligned.

Source code in playa/miner.py
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
def find_neighbors(
    self,
    plane: Plane[LTComponentT],
    ratio: float,
) -> List[LTTextLine]:
    """Finds neighboring LTTextLineHorizontals in the plane.

    Returns a list of other LTTestLineHorizontals in the plane which are
    close to self. "Close" can be controlled by ratio. The returned objects
    will be the same height as self, and also either left-, right-, or
    centrally-aligned.
    """
    d = ratio * self.height
    objs = plane.find((self.x0, self.y0 - d, self.x1, self.y1 + d))
    return [
        obj
        for obj in objs
        if (
            isinstance(obj, LTTextLineHorizontal)
            and self._is_same_height_as(obj, tolerance=d)
            and (
                self._is_left_aligned_with(obj, tolerance=d)
                or self._is_right_aligned_with(obj, tolerance=d)
                or self._is_centrally_aligned_with(obj, tolerance=d)
            )
        )
    ]

LTTextLineVertical

Bases: LTTextLine

Source code in playa/miner.py
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
class LTTextLineVertical(LTTextLine):
    def __init__(self, word_margin: float = 0.0) -> None:
        super().__init__(word_margin)
        self._y0: float = -INF + 0.0

    # Incompatible override: we take an LTComponent (with bounding box), but
    # LTContainer only considers LTItem (no bounding box).
    def add(self, obj: LTComponent) -> None:  # type: ignore[override]
        if isinstance(obj, LTChar) and self.word_margin:
            margin = self.word_margin * max(obj.width, obj.height)
            if obj.y1 + margin < self._y0:
                # FIXME: Should probably inherit mcstack somehow
                LTContainer.add(self, LTAnno(" "))
        self._y0 = obj.y0
        super().add(obj)

    def find_neighbors(
        self,
        plane: Plane[LTComponentT],
        ratio: float,
    ) -> List[LTTextLine]:
        """Finds neighboring LTTextLineVerticals in the plane.

        Returns a list of other LTTextLineVerticals in the plane which are
        close to self. "Close" can be controlled by ratio. The returned objects
        will be the same width as self, and also either upper-, lower-, or
        centrally-aligned.
        """
        d = ratio * self.width
        objs = plane.find((self.x0 - d, self.y0, self.x1 + d, self.y1))
        return [
            obj
            for obj in objs
            if (
                isinstance(obj, LTTextLineVertical)
                and self._is_same_width_as(obj, tolerance=d)
                and (
                    self._is_lower_aligned_with(obj, tolerance=d)
                    or self._is_upper_aligned_with(obj, tolerance=d)
                    or self._is_centrally_aligned_with(obj, tolerance=d)
                )
            )
        ]

    def _is_lower_aligned_with(self, other: LTComponent, tolerance: float = 0) -> bool:
        """Whether the lower edge of `other` is within `tolerance`."""
        return abs(other.y0 - self.y0) <= tolerance

    def _is_upper_aligned_with(self, other: LTComponent, tolerance: float = 0) -> bool:
        """Whether the upper edge of `other` is within `tolerance`."""
        return abs(other.y1 - self.y1) <= tolerance

    def _is_centrally_aligned_with(
        self,
        other: LTComponent,
        tolerance: float = 0,
    ) -> bool:
        """Whether the vertical center of `other` is within `tolerance`."""
        return abs((other.y0 + other.y1) / 2 - (self.y0 + self.y1) / 2) <= tolerance

    def _is_same_width_as(self, other: LTComponent, tolerance: float) -> bool:
        return abs(other.width - self.width) <= tolerance

find_neighbors(plane, ratio)

Finds neighboring LTTextLineVerticals in the plane.

Returns a list of other LTTextLineVerticals in the plane which are close to self. "Close" can be controlled by ratio. The returned objects will be the same width as self, and also either upper-, lower-, or centrally-aligned.

Source code in playa/miner.py
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
def find_neighbors(
    self,
    plane: Plane[LTComponentT],
    ratio: float,
) -> List[LTTextLine]:
    """Finds neighboring LTTextLineVerticals in the plane.

    Returns a list of other LTTextLineVerticals in the plane which are
    close to self. "Close" can be controlled by ratio. The returned objects
    will be the same width as self, and also either upper-, lower-, or
    centrally-aligned.
    """
    d = ratio * self.width
    objs = plane.find((self.x0 - d, self.y0, self.x1 + d, self.y1))
    return [
        obj
        for obj in objs
        if (
            isinstance(obj, LTTextLineVertical)
            and self._is_same_width_as(obj, tolerance=d)
            and (
                self._is_lower_aligned_with(obj, tolerance=d)
                or self._is_upper_aligned_with(obj, tolerance=d)
                or self._is_centrally_aligned_with(obj, tolerance=d)
            )
        )
    ]

NameTree

A PDF name tree.

See Section 7.9.6 of the PDF 1.7 Reference.

Source code in playa/data_structures.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
class NameTree:
    """A PDF name tree.

    See Section 7.9.6 of the PDF 1.7 Reference.
    """

    def __init__(self, obj: Any):
        self._obj = dict_value(obj)

    def __iter__(self) -> Iterator[Tuple[bytes, Any]]:
        return walk_name_tree(self._obj, None)

    def __contains__(self, name: bytes) -> bool:
        for idx, val in self:
            if idx == name:
                return True
        return False

    def __getitem__(self, name: bytes) -> Any:
        for idx, val in self:
            if idx == name:
                return val
        raise IndexError("Name %r not in tree" % name)

NumberTree

A PDF number tree.

See Section 7.9.7 of the PDF 1.7 Reference.

Source code in playa/data_structures.py
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class NumberTree:
    """A PDF number tree.

    See Section 7.9.7 of the PDF 1.7 Reference.
    """

    def __init__(self, obj: Any):
        self._obj = dict_value(obj)

    def __iter__(self) -> Iterator[Tuple[int, Any]]:
        return walk_number_tree(self._obj)

    def __contains__(self, num: int) -> bool:
        for idx, _ in walk_number_tree(self._obj, num):
            if idx == num:
                return True
        return False

    def __getitem__(self, num: int) -> Any:
        for idx, val in walk_number_tree(self._obj, num):
            if idx == num:
                return val
        raise IndexError(f"Number {num} not in tree")

PDFDocument

Representation of a PDF document.

Since PDF documents can be very large and complex, merely creating a Document does very little aside from verifying that the password is correct and getting a minimal amount of metadata. In general, PLAYA will try to open just about anything as a PDF, so you should not expect the constructor to fail here if you give it nonsense (something else may fail later on).

Some metadata, such as the structure tree and page tree, will be loaded lazily and cached. We do not handle modification of PDFs.

Parameters:

Name Type Description Default
fp Union[BinaryIO, bytes]

File-like object in binary mode, or a buffer with binary data. Files will be read using mmap if possible. They do not need to be seekable, as if mmap fails the entire file will simply be read into memory (so a pipe or socket ought to work).

required
password str

Password for decryption, if needed.

''
space DeviceSpace

the device space to use for interpreting content ("screen" or "page")

'screen'

Raises:

Type Description
TypeError

if fp is a file opened in text mode (don't do that!)

PDFEncryptionError

if the PDF has an unsupported encryption scheme

PDFPasswordIncorrect

if the password is incorrect

Source code in playa/document.py
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
class Document:
    """Representation of a PDF document.

    Since PDF documents can be very large and complex, merely creating
    a `Document` does very little aside from verifying that the
    password is correct and getting a minimal amount of metadata.  In
    general, PLAYA will try to open just about anything as a PDF, so
    you should not expect the constructor to fail here if you give it
    nonsense (something else may fail later on).

    Some metadata, such as the structure tree and page tree, will be
    loaded lazily and cached.  We do not handle modification of PDFs.

    Args:
      fp: File-like object in binary mode, or a buffer with binary data.
          Files will be read using `mmap` if possible.  They do not need
          to be seekable, as if `mmap` fails the entire file will simply
          be read into memory (so a pipe or socket ought to work).
      password: Password for decryption, if needed.
      space: the device space to use for interpreting content ("screen"
          or "page")

    Raises:
      TypeError: if `fp` is a file opened in text mode (don't do that!)
      PDFEncryptionError: if the PDF has an unsupported encryption scheme
      PDFPasswordIncorrect: if the password is incorrect
    """

    _fp: Union[BinaryIO, None] = None
    _pages: Union["PageList", None] = None
    _pool: Union[Executor, None] = None
    _outline: Union["Outline", None] = None
    _destinations: Union["Destinations", None] = None
    _structure: Union["Tree", None]
    _fontmap: Union[Dict[str, Font], None] = None

    def __enter__(self) -> "Document":
        return self

    def __exit__(self, exc_type, exc_value, traceback) -> None:
        self.close()

    def close(self) -> None:
        # If we were opened from a file then close it
        if self._fp:
            self._fp.close()
            self._fp = None
        # Shutdown process pool
        if self._pool:
            self._pool.shutdown()
            self._pool = None

    def __init__(
        self,
        fp: Union[BinaryIO, bytes],
        password: str = "",
        space: DeviceSpace = "screen",
        _boss_id: int = 0,
    ) -> None:
        if _boss_id:
            # Set this **right away** because it is needed to get
            # indirect object references right.
            _set_document(self, _boss_id)
            assert in_worker()
        self.xrefs: List[XRef] = []
        self.space = space
        self.info = []
        self.catalog: Dict[str, Any] = {}
        self.encryption: Optional[Tuple[Any, Any]] = None
        self.decipher: Optional[DecipherCallable] = None
        self._cached_objs: Dict[int, PDFObject] = {}
        self._parsed_objs: Dict[int, Tuple[List[PDFObject], int]] = {}
        self._cached_fonts: Dict[int, Font] = {}
        self._cached_inline_images: Dict[
            Tuple[int, int], Tuple[int, Optional[InlineImage]]
        ] = {}
        if isinstance(fp, io.TextIOBase):
            raise TypeError("fp is not a binary file")
        self.pdf_version, self.offset, self.buffer = _open_input(fp)
        self.is_printable = self.is_modifiable = self.is_extractable = True
        # Getting the XRef table and trailer is done non-lazily
        # because they contain encryption information among other
        # things.  As noted above we don't try to look for the first
        # page cross-reference table (for linearized PDFs) after the
        # header, it will instead be loaded with all the rest.
        self.parser = IndirectObjectParser(self.buffer, self)
        self.parser.seek(self.offset)
        self._xrefpos: Set[int] = set()
        try:
            self._read_xrefs()
        except Exception as e:
            log.debug(
                "Failed to parse xref table, falling back to object parser: %s",
                e,
            )
            newxref = XRefFallback(self.parser)
            self.xrefs.append(newxref)
        # Now find the trailer
        for xref in self.xrefs:
            trailer = xref.trailer
            if not trailer:
                continue
            # If there's an encryption info, remember it.
            if "Encrypt" in trailer:
                if "ID" in trailer:
                    id_value = list_value(trailer["ID"])
                else:
                    # Some documents may not have a /ID, use two empty
                    # byte strings instead. Solves
                    # https://github.com/pdfminer/pdfminer.six/issues/594
                    id_value = (b"", b"")
                self.encryption = (id_value, dict_value(trailer["Encrypt"]))
                self._initialize_password(password)
            if "Info" in trailer:
                try:
                    self.info.append(dict_value(trailer["Info"]))
                except TypeError:
                    log.warning("Info is a broken reference (incorrect xref table?)")
            if "Root" in trailer:
                # Every PDF file must have exactly one /Root dictionary.
                try:
                    self.catalog = dict_value(trailer["Root"])
                except TypeError:
                    log.warning("Root is a broken reference (incorrect xref table?)")
                    self.catalog = {}
                break
        else:
            log.warning("No /Root object! - Is this really a PDF?")
        if self.catalog.get("Type") is not LITERAL_CATALOG:
            log.warning("Catalog not found!")
        if "Version" in self.catalog:
            log.debug(
                "Using PDF version %r from catalog instead of %r from header",
                self.catalog["Version"],
                self.pdf_version,
            )
            self.pdf_version = literal_name(self.catalog["Version"])
        self.is_tagged = False
        markinfo = resolve1(self.catalog.get("MarkInfo"))
        if isinstance(markinfo, dict):
            self.is_tagged = not not markinfo.get("Marked")

    def _read_xrefs(self):
        try:
            xrefpos = self._find_xref()
        except Exception as e:
            raise PDFSyntaxError("No xref table found at end of file") from e
        try:
            self._read_xref_from(xrefpos, self.xrefs)
            return
        except (ValueError, IndexError, StopIteration, PDFSyntaxError) as e:
            log.warning("Checking for two PDFs in a trenchcoat: %s", e)
            xrefpos = self._detect_concatenation(xrefpos)
            if xrefpos == -1:
                raise PDFSyntaxError("Failed to read xref table at end of file") from e
        try:
            self._read_xref_from(xrefpos, self.xrefs)
        except (ValueError, IndexError, StopIteration, PDFSyntaxError) as e:
            raise PDFSyntaxError(
                "Failed to read xref table with adjusted offset"
            ) from e

    def _detect_concatenation(self, xrefpos: int) -> int:
        # Detect the case where two (or more) PDFs have been
        # concatenated, or where somebody tried an "incremental
        # update" without updating the xref table
        filestart = self.buffer.rfind(b"%%EOF")
        log.debug("Found ultimate %%EOF at %d", filestart)
        if filestart != -1:
            filestart = self.buffer.rfind(b"%%EOF", 0, filestart)
            log.debug("Found penultimate %%EOF at %d", filestart)
        if filestart != -1:
            filestart += 5
            while self.buffer[filestart] in (10, 13):
                filestart += 1
            parser = ObjectParser(self.buffer, self, filestart + xrefpos)
            try:
                (pos, token) = parser.nexttoken()
            except StopIteration:
                raise ValueError(f"Unexpected EOF at {filestart}")
            if token is KEYWORD_XREF:
                log.debug(
                    "Found two PDFs in a trenchcoat at %d "
                    "(second xref is at %d not %d)",
                    filestart,
                    pos,
                    xrefpos,
                )
                self.offset = filestart
                return pos
        return -1

    def _initialize_password(self, password: str = "") -> None:
        """Initialize the decryption handler with a given password, if any.

        Internal function, requires the Encrypt dictionary to have
        been read from the trailer into self.encryption.
        """
        assert self.encryption is not None
        (docid, param) = self.encryption
        if literal_name(param.get("Filter")) != "Standard":
            raise PDFEncryptionError("Unknown filter: param=%r" % param)
        v = int_value(param.get("V", 0))
        # 3 (PDF 1.4) An unpublished algorithm that permits encryption
        # key lengths ranging from 40 to 128 bits. This value shall
        # not appear in a conforming PDF file.
        if v == 3:
            raise PDFEncryptionError("Unpublished algorithm 3 not supported")
        factory = SECURITY_HANDLERS.get(v)
        # 0 An algorithm that is undocumented. This value shall not be used.
        if factory is None:
            raise PDFEncryptionError("Unknown algorithm: param=%r" % param)
        handler = factory(docid, param, password)
        self.decipher = handler.decrypt
        self.is_printable = handler.is_printable
        self.is_modifiable = handler.is_modifiable
        self.is_extractable = handler.is_extractable
        assert self.parser is not None
        # Ensure that no extra data leaks into encrypted streams
        self.parser.strict = True
        self.parser.decipher = self.decipher

    def __iter__(self) -> Iterator[IndirectObject]:
        """Iterate over top-level `IndirectObject` (does not expand object streams)"""
        return (
            obj
            for pos, obj in IndirectObjectParser(
                self.buffer, self, pos=self.offset, strict=self.parser.strict
            )
        )

    @property
    def objects(self) -> Iterator[IndirectObject]:
        """Iterate over all indirect objects (including, then expanding object
        streams)"""
        for _, obj in IndirectObjectParser(
            self.buffer, self, pos=self.offset, strict=self.parser.strict
        ):
            yield obj
            if (
                isinstance(obj.obj, ContentStream)
                and obj.obj.get("Type") is LITERAL_OBJSTM
            ):
                parser = ObjectStreamParser(obj.obj, self)
                for _, sobj in parser:
                    yield sobj

    @property
    def tokens(self) -> Iterator[Token]:
        """Iterate over tokens."""
        return (tok for pos, tok in Lexer(self.buffer))

    @property
    def structure(self) -> Union[Tree, None]:
        """Logical structure of this document, if any.

        In the case where no logical structure tree exists, this will
        be `None`.  Otherwise you may iterate over it, search it, etc.

        We do this instead of simply returning an empty structure tree
        because the vast majority of PDFs have no logical structure.
        Also, because the structure is a lazy object (the type
        signature here may change to `Iterable[Element]` at some
        point) there is no way to know if it's empty without iterating
        over it.

        """
        if hasattr(self, "_structure"):
            return self._structure
        try:
            self._structure = Tree(self)
        except (TypeError, KeyError):
            self._structure = None
        return self._structure

    def _getobj_objstm(
        self, stream: ContentStream, index: int, objid: int
    ) -> PDFObject:
        if stream.objid in self._parsed_objs:
            (objs, n) = self._parsed_objs[stream.objid]
        else:
            (objs, n) = self._get_objects(stream)
            assert stream.objid is not None
            self._parsed_objs[stream.objid] = (objs, n)
        i = n * 2 + index
        try:
            obj = objs[i]
        except IndexError:
            raise PDFSyntaxError("index too big: %r" % index)
        return obj

    def _get_objects(self, stream: ContentStream) -> Tuple[List[PDFObject], int]:
        if stream.get("Type") is not LITERAL_OBJSTM:
            log.warning("Content stream Type is not /ObjStm: %r" % stream)
        try:
            n = int_value(stream["N"])
        except KeyError:
            log.warning("N is not defined in content stream: %r" % stream)
            n = 0
        except TypeError:
            log.warning("N is invalid in content stream: %r" % stream)
            n = 0
        parser = ObjectParser(stream.buffer, self)
        objs: List[PDFObject] = [obj for _, obj in parser]
        return (objs, n)

    def _getobj_parse(self, pos: int, objid: int) -> PDFObject:
        assert self.parser is not None
        self.parser.seek(pos)
        try:
            m = INDOBJR.match(self.buffer, pos)
            if m is None:
                raise PDFSyntaxError(
                    f"Not an indirect object at position {pos}: "
                    f"{self.buffer[pos:pos+8]!r}"
                )
            _, obj = next(self.parser)
            if obj.objid != objid:
                raise PDFSyntaxError(f"objid mismatch: {obj.objid!r}={objid!r}")
        except (ValueError, IndexError, PDFSyntaxError) as e:
            if self.parser.strict:
                raise PDFSyntaxError(
                    "Indirect object %d not found at position %d"
                    % (
                        objid,
                        pos,
                    )
                )
            else:
                log.warning(
                    "Indirect object %d not found at position %d: %r", objid, pos, e
                )
            obj = self._getobj_parse_approx(pos, objid)
        if obj.objid != objid:
            raise PDFSyntaxError(f"objid mismatch: {obj.objid!r}={objid!r}")
        return obj.obj

    def _getobj_parse_approx(self, pos: int, objid: int) -> IndirectObject:
        # In case of malformed pdf files where the offset in the
        # xref table doesn't point exactly at the object
        # definition (probably more frequent than you think), just
        # use a regular expression to find the object because we
        # can do that.
        realpos = -1
        lastgen = -1
        for m in re.finditer(rb"\b%d\s+(\d+)\s+obj" % objid, self.buffer):
            genno = int(m.group(1))
            if genno > lastgen:
                lastgen = genno
                realpos = m.start(0)
        if realpos == -1:
            raise PDFSyntaxError(f"Indirect object {objid} not found in document")
        self.parser.seek(realpos)
        (_, obj) = next(self.parser)
        return obj

    def __getitem__(self, objid: int) -> PDFObject:
        """Get an indirect object from the PDF.

        Note that the behaviour in the case of a non-existent object
        (raising `IndexError`), while Pythonic, is not PDFic, as PDF
        1.7 sec 7.3.10 states:

        > An indirect reference to an undefined object shall not be
        considered an error by a conforming reader; it shall be
        treated as a reference to the null object.

        Raises:
          ValueError: if Document is not initialized
          IndexError: if objid does not exist in PDF

        """
        if not self.xrefs:
            raise ValueError("Document is not initialized")
        if objid not in self._cached_objs:
            obj = None
            for xref in self.xrefs:
                try:
                    (strmid, index, genno) = xref.get_pos(objid)
                except KeyError:
                    continue
                try:
                    if strmid is not None:
                        stream = stream_value(self[strmid])
                        obj = self._getobj_objstm(stream, index, objid)
                    else:
                        obj = self._getobj_parse(index, objid)
                    break
                # FIXME: We might not actually want to catch these...
                except StopIteration:
                    log.debug("EOF when searching for object %d", objid)
                    continue
                except PDFSyntaxError as e:
                    log.debug("Syntax error when searching for object %d: %s", objid, e)
                    continue
            # Store it anyway as None if we can't find it to avoid costly searching
            self._cached_objs[objid] = obj
        # To get standards compliant behaviour simply remove this
        if self._cached_objs[objid] is None:
            raise IndexError(f"Object with ID {objid} not found")
        return self._cached_objs[objid]

    def get_font(
        self, objid: int = 0, spec: Union[Dict[str, PDFObject], None] = None
    ) -> Font:
        if objid and objid in self._cached_fonts:
            return self._cached_fonts[objid]
        if spec is None:
            return Font({}, {})
        # Create a Font object, hopefully
        font: Union[Font, None] = None
        if spec.get("Type") is not LITERAL_FONT:
            log.warning("Font Type is not /Font: %r", spec)
        subtype = spec.get("Subtype")
        if subtype in (LITERAL_TYPE1, LITERAL_MMTYPE1):
            font = Type1Font(spec)
        elif subtype is LITERAL_TRUETYPE:
            font = TrueTypeFont(spec)
        elif subtype == LITERAL_TYPE3:
            font = Type3Font(spec)
        elif subtype == LITERAL_TYPE0:
            if "DescendantFonts" not in spec:
                log.warning("Type0 font has no DescendantFonts: %r", spec)
            else:
                dfonts = list_value(spec["DescendantFonts"])
                if len(dfonts) != 1:
                    log.debug(
                        "Type 0 font should have 1 descendant, has more: %r", dfonts
                    )
                subspec = resolve1(dfonts[0])
                if not isinstance(subspec, dict):
                    log.warning("Invalid descendant font: %r", subspec)
                else:
                    subspec = subspec.copy()
                    # Merge the root and descendant font dictionaries
                    for k in ("Encoding", "ToUnicode"):
                        if k in spec:
                            subspec[k] = resolve1(spec[k])
                    font = CIDFont(subspec)
        else:
            log.warning("Unknown Subtype in font: %r" % spec)
        if font is None:
            # We need a dummy font object to be able to do *something*
            # (even if it's the wrong thing) with text objects.
            font = Font({}, {})
        if objid:
            self._cached_fonts[objid] = font
        return font

    @property
    def fonts(self) -> Mapping[str, Font]:
        """Get the mapping of font names to fonts for this document.

        Note that this can be quite slow the first time it's accessed
        as it must scan every single page in the document.

        Note: Font names may collide.
            Font names are generally understood to be globally unique
            <del>in the neighbourhood</del> in the document, but there's no
            guarantee that this is the case.  In keeping with the
            "incremental update" philosophy dear to PDF, you get the
            last font with a given name.

        Danger: Do not rely on this being a `dict`.
            Currently this is implemented eagerly, but in the future it
            may return a lazy object which only loads fonts on demand.

        """
        if self._fontmap is not None:
            return self._fontmap
        self._fontmap: Dict[str, Font] = {}
        for idx, page in enumerate(self.pages):
            for font in page.fonts.values():
                self._fontmap[font.fontname] = font
        return self._fontmap

    @property
    def outline(self) -> Union[Outline, None]:
        """Document outline, if any."""
        if "Outlines" not in self.catalog:
            return None
        if self._outline is None:
            try:
                self._outline = Outline(self)
            except TypeError:
                log.warning(
                    "Invalid Outlines entry in catalog: %r", self.catalog["Outlines"]
                )
                return None
        return self._outline

    @property
    def page_labels(self) -> Iterator[str]:
        """Generate page label strings for the PDF document.

        If the document includes page labels, generates strings, one per page.
        If not, raise KeyError.

        The resulting iterator is unbounded (because the page label
        tree does not actually include all the pages), so it is
        recommended to use `pages` instead.

        Raises:
          KeyError: No page labels are present in the catalog

        """
        assert self.catalog is not None  # really it cannot be None

        page_labels = PageLabels(self.catalog["PageLabels"])
        return page_labels.labels

    def _get_pages_from_xrefs(
        self,
    ) -> Iterator[Tuple[int, Dict[str, Dict[str, PDFObject]]]]:
        """Find pages from the cross-reference tables if the page tree
        is missing (note that this only happens in invalid PDFs, but
        it happens.)

        Returns:
          an iterator over (objid, dict) pairs.
        """
        for xref in self.xrefs:
            for object_id in xref.objids:
                try:
                    obj = self[object_id]
                    if isinstance(obj, dict) and obj.get("Type") is LITERAL_PAGE:
                        yield object_id, obj
                except IndexError:
                    pass

    def _get_page_objects(
        self,
    ) -> Iterator[Tuple[int, Dict[str, Dict[str, PDFObject]]]]:
        """Iterate over the flattened page tree in reading order, propagating
        inheritable attributes.  Returns an iterator over (objid, dict) pairs.

        Raises:
          KeyError: if there is no page tree.
        """
        if "Pages" not in self.catalog:
            raise KeyError("No 'Pages' entry in catalog")
        stack = [(self.catalog["Pages"], self.catalog)]
        visited = set()
        while stack:
            (obj, parent) = stack.pop()
            if isinstance(obj, ObjRef):
                # The PDF specification *requires* both the Pages
                # element of the catalog and the entries in Kids in
                # the page tree to be indirect references.
                object_id = int(obj.objid)
            elif isinstance(obj, int):
                # Should not happen in a valid PDF, but probably does?
                log.warning("Page tree contains bare integer: %r in %r", obj, parent)
                object_id = obj
            elif obj is None:
                log.warning("Skipping null value in page tree")
                continue
            else:
                log.warning("Page tree contains unknown object: %r", obj)
            try:
                page_object = dict_value(self[object_id])
            except IndexError as e:
                log.warning("Skipping missing page object: %s", e)
                continue

            # Avoid recursion errors by keeping track of visited nodes
            # (again, this should never actually happen in a valid PDF)
            if object_id in visited:
                log.warning("Circular reference %r in page tree", obj)
                continue
            visited.add(object_id)

            # Propagate inheritable attributes
            object_properties = page_object.copy()
            for k, v in parent.items():
                if k in INHERITABLE_PAGE_ATTRS and k not in object_properties:
                    object_properties[k] = v

            # Recurse, depth-first
            object_type = object_properties.get("Type")
            if object_type is None:
                log.warning("Page has no Type, trying type: %r", object_properties)
                object_type = object_properties.get("type")
            if object_type is LITERAL_PAGES and "Kids" in object_properties:
                for child in reversed(list_value(object_properties["Kids"])):
                    stack.append((child, object_properties))
            elif object_type is LITERAL_PAGE:
                yield object_id, object_properties

    @property
    def pages(self) -> "PageList":
        """Pages of the document as an iterable/addressable `PageList` object."""
        if self._pages is None:
            self._pages = PageList(self)
        return self._pages

    @property
    def names(self) -> Dict[str, Any]:
        """PDF name dictionary (PDF 1.7 sec 7.7.4).

        Raises:
          KeyError: if nonexistent.
        """
        return dict_value(self.catalog["Names"])

    @property
    def destinations(self) -> "Destinations":
        """Named destinations as an iterable/addressable `Destinations` object."""
        if self._destinations is None:
            self._destinations = Destinations(self)
        return self._destinations

    def _find_xref(self) -> int:
        """Internal function used to locate the first XRef."""
        # Look for startxref and try to get a position from the
        # following token (there is supposed to be a newline, but...)
        pos = self.buffer.rfind(b"startxref")
        if pos != -1:
            m = STARTXREFR.match(self.buffer, pos)
            if m is not None:
                start = int(m[1])
                if start > pos:
                    raise ValueError(
                        "Invalid startxref position (> %d): %d" % (pos, start)
                    )
                return start + self.offset

        # Otherwise, just look for an xref, raising ValueError
        pos = self.buffer.rfind(b"xref")
        if pos == -1:
            raise ValueError("xref not found in document")
        return pos

    # read xref table
    def _read_xref_from(
        self,
        start: int,
        xrefs: List[XRef],
    ) -> None:
        """Reads XRefs from the given location."""
        if start in self._xrefpos:
            log.warning("Detected circular xref chain at %d", start)
            return
        # Look for an XRefStream first, then an XRefTable
        if INDOBJR.match(self.buffer, start):
            log.debug("Reading xref stream at %d", start)
            # XRefStream: PDF-1.5
            self.parser.seek(start)
            self.parser.reset()
            xref: XRef = XRefStream(self.parser, self.offset)
        elif m := XREFR.match(self.buffer, start):
            log.debug("Reading xref table at %d", m.start(1))
            parser = ObjectParser(self.buffer, self, pos=m.start(1))
            xref = XRefTable(
                parser,
                self.offset,
            )
        else:
            # Well, maybe it's an XRef table without "xref" (but
            # probably not)
            parser = ObjectParser(self.buffer, self, pos=start)
            xref = XRefTable(parser, self.offset)
        self._xrefpos.add(start)
        xrefs.append(xref)
        trailer = xref.trailer
        # For hybrid-reference files, an additional set of xrefs as a
        # stream.
        if "XRefStm" in trailer:
            pos = int_value(trailer["XRefStm"])
            self._read_xref_from(pos + self.offset, xrefs)
        # Recurse into any previous xref tables or streams
        if "Prev" in trailer:
            # find previous xref
            pos = int_value(trailer["Prev"])
            self._read_xref_from(pos + self.offset, xrefs)

destinations property

Named destinations as an iterable/addressable Destinations object.

fonts property

Get the mapping of font names to fonts for this document.

Note that this can be quite slow the first time it's accessed as it must scan every single page in the document.

Note: Font names may collide. Font names are generally understood to be globally unique in the neighbourhood in the document, but there's no guarantee that this is the case. In keeping with the "incremental update" philosophy dear to PDF, you get the last font with a given name.

Danger: Do not rely on this being a dict. Currently this is implemented eagerly, but in the future it may return a lazy object which only loads fonts on demand.

names property

PDF name dictionary (PDF 1.7 sec 7.7.4).

Raises:

Type Description
KeyError

if nonexistent.

objects property

Iterate over all indirect objects (including, then expanding object streams)

outline property

Document outline, if any.

page_labels property

Generate page label strings for the PDF document.

If the document includes page labels, generates strings, one per page. If not, raise KeyError.

The resulting iterator is unbounded (because the page label tree does not actually include all the pages), so it is recommended to use pages instead.

Raises:

Type Description
KeyError

No page labels are present in the catalog

pages property

Pages of the document as an iterable/addressable PageList object.

structure property

Logical structure of this document, if any.

In the case where no logical structure tree exists, this will be None. Otherwise you may iterate over it, search it, etc.

We do this instead of simply returning an empty structure tree because the vast majority of PDFs have no logical structure. Also, because the structure is a lazy object (the type signature here may change to Iterable[Element] at some point) there is no way to know if it's empty without iterating over it.

tokens property

Iterate over tokens.

__getitem__(objid)

Get an indirect object from the PDF.

Note that the behaviour in the case of a non-existent object (raising IndexError), while Pythonic, is not PDFic, as PDF 1.7 sec 7.3.10 states:

An indirect reference to an undefined object shall not be considered an error by a conforming reader; it shall be treated as a reference to the null object.

Raises:

Type Description
ValueError

if Document is not initialized

IndexError

if objid does not exist in PDF

Source code in playa/document.py
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
def __getitem__(self, objid: int) -> PDFObject:
    """Get an indirect object from the PDF.

    Note that the behaviour in the case of a non-existent object
    (raising `IndexError`), while Pythonic, is not PDFic, as PDF
    1.7 sec 7.3.10 states:

    > An indirect reference to an undefined object shall not be
    considered an error by a conforming reader; it shall be
    treated as a reference to the null object.

    Raises:
      ValueError: if Document is not initialized
      IndexError: if objid does not exist in PDF

    """
    if not self.xrefs:
        raise ValueError("Document is not initialized")
    if objid not in self._cached_objs:
        obj = None
        for xref in self.xrefs:
            try:
                (strmid, index, genno) = xref.get_pos(objid)
            except KeyError:
                continue
            try:
                if strmid is not None:
                    stream = stream_value(self[strmid])
                    obj = self._getobj_objstm(stream, index, objid)
                else:
                    obj = self._getobj_parse(index, objid)
                break
            # FIXME: We might not actually want to catch these...
            except StopIteration:
                log.debug("EOF when searching for object %d", objid)
                continue
            except PDFSyntaxError as e:
                log.debug("Syntax error when searching for object %d: %s", objid, e)
                continue
        # Store it anyway as None if we can't find it to avoid costly searching
        self._cached_objs[objid] = obj
    # To get standards compliant behaviour simply remove this
    if self._cached_objs[objid] is None:
        raise IndexError(f"Object with ID {objid} not found")
    return self._cached_objs[objid]

__iter__()

Iterate over top-level IndirectObject (does not expand object streams)

Source code in playa/document.py
358
359
360
361
362
363
364
365
def __iter__(self) -> Iterator[IndirectObject]:
    """Iterate over top-level `IndirectObject` (does not expand object streams)"""
    return (
        obj
        for pos, obj in IndirectObjectParser(
            self.buffer, self, pos=self.offset, strict=self.parser.strict
        )
    )

PDFObjRef

Source code in playa/pdftypes.py
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
class ObjRef:
    def __init__(
        self,
        doc: Union[DocumentRef, None],
        objid: int,
    ) -> None:
        """Reference to a PDF object.

        :param doc: The PDF document.
        :param objid: The object number.
        """
        if objid == 0:
            raise ValueError("PDF object id cannot be 0.")

        self.doc = doc
        self.objid = objid

    def __eq__(self, other: Any) -> bool:
        if not isinstance(other, ObjRef):
            raise NotImplementedError("Unimplemented comparison with non-ObjRef")
        if self.doc is None and other.doc is None:
            return self.objid == other.objid
        elif self.doc is None or other.doc is None:
            return False
        else:
            selfdoc = _deref_document(self.doc)
            otherdoc = _deref_document(other.doc)
            return selfdoc is otherdoc and self.objid == other.objid

    def __hash__(self) -> int:
        return self.objid

    def __repr__(self) -> str:
        return "<ObjRef:%d>" % (self.objid)

    def resolve(self, default: Any = None) -> Any:
        if self.doc is None:
            return default
        doc = _deref_document(self.doc)
        try:
            return doc[self.objid]
        except IndexError:
            return default

__init__(doc, objid)

Reference to a PDF object.

Parameters:

Name Type Description Default
doc Union[DocumentRef, None]

The PDF document.

required
objid int

The object number.

required
Source code in playa/pdftypes.py
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
def __init__(
    self,
    doc: Union[DocumentRef, None],
    objid: int,
) -> None:
    """Reference to a PDF object.

    :param doc: The PDF document.
    :param objid: The object number.
    """
    if objid == 0:
        raise ValueError("PDF object id cannot be 0.")

    self.doc = doc
    self.objid = objid

PDFPage

An object that holds the information about a page.

Parameters:

Name Type Description Default
doc Document

a Document object.

required
pageid int

the integer PDF object ID associated with the page in the page tree.

required
attrs Dict

a dictionary of page attributes.

required
label Optional[str]

page label string.

required
page_idx int

0-based index of the page in the document.

0
space DeviceSpace

the device space to use for interpreting content

'screen'

Attributes:

Name Type Description
pageid

the integer object ID associated with the page in the page tree

attrs

a dictionary of page attributes.

resources Dict[str, PDFObject]

a dictionary of resources used by the page.

mediabox

the physical size of the page.

cropbox

the crop rectangle of the page.

rotate

the page rotation (in degree).

label

the page's label (typically, the logical page number).

page_idx

0-based index of the page in the document.

ctm

coordinate transformation matrix from default user space to page's device space

Source code in playa/page.py
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
class Page:
    """An object that holds the information about a page.

    Args:
      doc: a Document object.
      pageid: the integer PDF object ID associated with the page in the page tree.
      attrs: a dictionary of page attributes.
      label: page label string.
      page_idx: 0-based index of the page in the document.
      space: the device space to use for interpreting content

    Attributes:
      pageid: the integer object ID associated with the page in the page tree
      attrs: a dictionary of page attributes.
      resources: a dictionary of resources used by the page.
      mediabox: the physical size of the page.
      cropbox: the crop rectangle of the page.
      rotate: the page rotation (in degree).
      label: the page's label (typically, the logical page number).
      page_idx: 0-based index of the page in the document.
      ctm: coordinate transformation matrix from default user space to
           page's device space
    """

    def __init__(
        self,
        doc: "Document",
        pageid: int,
        attrs: Dict,
        label: Optional[str],
        page_idx: int = 0,
        space: DeviceSpace = "screen",
    ) -> None:
        self.docref = _ref_document(doc)
        self.pageid = pageid
        self.attrs = attrs
        self.label = label
        self.page_idx = page_idx
        self.space = space
        self.pageref = _ref_page(self)
        self.lastmod = resolve1(self.attrs.get("LastModified"))
        try:
            self.resources: Dict[str, PDFObject] = dict_value(
                self.attrs.get("Resources")
            )
        except TypeError:
            log.warning("Resources missing or invalid from Page id %d", pageid)
            self.resources = {}
        try:
            self.mediabox = normalize_rect(rect_value(self.attrs["MediaBox"]))
        except KeyError:
            log.warning(
                "MediaBox missing from Page id %d (and not inherited),"
                " defaulting to US Letter (612x792)",
                pageid,
            )
            self.mediabox = (0, 0, 612, 792)
        except (ValueError, PDFSyntaxError):
            log.warning(
                "MediaBox %r invalid in Page id %d,"
                " defaulting to US Letter (612x792)",
                self.attrs["MediaBox"],
                pageid,
            )
            self.mediabox = (0, 0, 612, 792)
        self.cropbox = self.mediabox
        if "CropBox" in self.attrs:
            try:
                self.cropbox = normalize_rect(rect_value(self.attrs["CropBox"]))
            except (ValueError, PDFSyntaxError):
                log.warning(
                    "Invalid CropBox %r in /Page, defaulting to MediaBox",
                    self.attrs["CropBox"],
                )

        # This is supposed to be an int, but be robust to bogus PDFs where it isn't
        rotate = int(num_value(self.attrs.get("Rotate", 0)))
        self.set_initial_ctm(space, rotate)

        contents = resolve1(self.attrs.get("Contents"))
        if contents is None:
            self._contents = []
        else:
            if isinstance(contents, list):
                self._contents = contents
            else:
                self._contents = [contents]

    def set_initial_ctm(self, space: DeviceSpace, rotate: int) -> Matrix:
        """
        Set or update initial coordinate transform matrix.

        PDF 1.7 section 8.4.1: Initial value: a matrix that
        transforms default user coordinates to device coordinates.

        We keep this as `self.ctm` in order to transform layout
        attributes in tagged PDFs which are specified in default
        user space (PDF 1.7 section 14.8.5.4.3, table 344)

        If you wish to modify the rotation or the device space of the
        page, then you can do it with this method (the initial values
        are in the `rotate` and `space` properties).
        """
        # Normalize the rotation value
        rotate = (rotate + 360) % 360
        x0, y0, x1, y1 = self.mediabox
        width = x1 - x0
        height = y1 - y0
        self.ctm = MATRIX_IDENTITY
        if rotate == 90:
            # x' = y
            # y' = width - x
            self.ctm = (0, -1, 1, 0, 0, width)
        elif rotate == 180:
            # x' = width - x
            # y' = height - y
            self.ctm = (-1, 0, 0, -1, width, height)
        elif rotate == 270:
            # x' = height - y
            # y' = x
            self.ctm = (0, 1, -1, 0, height, 0)
        elif rotate != 0:
            log.warning(
                "Invalid rotation value %r (only multiples of 90 accepted)", rotate
            )
        # Apply this to the mediabox to determine device space
        (x0, y0, x1, y1) = transform_bbox(self.ctm, self.mediabox)
        width = x1 - x0
        height = y1 - y0
        # "screen" device space: origin is top left of MediaBox
        if space == "screen":
            self.ctm = mult_matrix(self.ctm, (1, 0, 0, -1, -x0, y1))
        # "page" device space: origin is bottom left of MediaBox
        elif space == "page":
            self.ctm = mult_matrix(self.ctm, (1, 0, 0, 1, -x0, -y0))
        # "default" device space: no transformation or rotation
        else:
            if space != "default":
                log.warning("Unknown device space: %r", space)
            self.ctm = MATRIX_IDENTITY
            width = height = 0
        self.space = space
        self.rotate = rotate
        return self.ctm

    @property
    def annotations(self) -> Iterator["Annotation"]:
        """Lazily iterate over page annotations."""
        alist = resolve1(self.attrs.get("Annots"))
        if alist is None:
            return
        if not isinstance(alist, list):
            log.warning("Invalid Annots list: %r", alist)
            return
        for obj in alist:
            try:
                yield Annotation.from_dict(obj, self)
            except (TypeError, ValueError, PDFSyntaxError) as e:
                log.warning("Invalid object %r in Annots: %s", obj, e)
                continue

    @property
    def doc(self) -> "Document":
        """Get associated document if it exists."""
        return _deref_document(self.docref)

    @property
    def streams(self) -> Iterator[ContentStream]:
        """Return resolved content streams."""
        for obj in self._contents:
            try:
                yield stream_value(obj)
            except TypeError:
                log.warning("Found non-stream in contents: %r", obj)

    @property
    def width(self) -> float:
        """Width of the page in default user space units."""
        x0, _, x1, _ = self.mediabox
        return x1 - x0

    @property
    def height(self) -> float:
        """Width of the page in default user space units."""
        _, y0, _, y1 = self.mediabox
        return y1 - y0

    @property
    def contents(self) -> Iterator[PDFObject]:
        """Iterator over PDF objects in the content streams."""
        for _, obj in ContentParser(self._contents, self.doc):
            yield obj

    def __iter__(self) -> Iterator["ContentObject"]:
        """Iterator over lazy layout objects."""
        return iter(LazyInterpreter(self, self._contents))

    @property
    def paths(self) -> Iterator["PathObject"]:
        """Iterator over lazy path objects."""
        return self.flatten(PathObject)

    @property
    def images(self) -> Iterator["ImageObject"]:
        """Iterator over lazy image objects."""
        return self.flatten(ImageObject)

    @property
    def texts(self) -> Iterator["TextObject"]:
        """Iterator over lazy text objects."""
        return self.flatten(TextObject)

    @property
    def glyphs(self) -> Iterator["GlyphObject"]:
        """Iterator over lazy glyph objects."""
        for text in self.flatten(TextObject):
            yield from text

    @property
    def xobjects(self) -> Iterator["XObjectObject"]:
        """Return resolved and rendered Form XObjects.

        This does *not* return any image or PostScript XObjects.  You
        can get images via the `images` property.  Apparently you
        aren't supposed to use PostScript XObjects for anything, ever.

        Note that these are the XObjects as rendered on the page, so
        you may see the same named XObject multiple times.  If you
        need to access their actual definitions you'll have to look at
        `page.resources`.

        This will also return Form XObjects within Form XObjects,
        except in the case of circular reference chains.
        """

        from typing import Set

        def xobjects_one(
            itor: Iterable["ContentObject"], parents: Set[int]
        ) -> Iterator["XObjectObject"]:
            for obj in itor:
                if isinstance(obj, XObjectObject):
                    stream_id = 0 if obj.stream.objid is None else obj.stream.objid
                    if stream_id not in parents:
                        yield obj
                        yield from xobjects_one(obj, parents | {stream_id})

        for obj in xobjects_one(self, set()):
            if isinstance(obj, XObjectObject):
                yield obj

    @property
    def tokens(self) -> Iterator[Token]:
        """Iterator over tokens in the content streams."""
        parser = ContentParser(self._contents, self.doc)
        while True:
            try:
                pos, tok = parser.nexttoken()
            except StopIteration:
                return
            yield tok

    @property
    def parent_key(self) -> Union[int, None]:
        """Parent tree key for this page, if any."""
        if "StructParents" in self.attrs:
            return int_value(self.attrs["StructParents"])
        return None

    @property
    def structure(self) -> "PageStructure":
        """Mapping of marked content IDs to logical structure elements.

        This is a sequence of logical structure elements, or `None`
        for unused marked content IDs.  Note that because structure
        elements may contain multiple marked content sections, the
        same element may occur multiple times in this list.

        It also has `find` and `find_all` methods which allow you to
        access enclosing structural elements (you can also use the
        `parent` method of elements for that)

        Note: This is not the same as `playa.Document.structure`.
            PDF documents have logical structure, but PDF pages **do
            not**, and it is dishonest to pretend otherwise (as some
            code I once wrote unfortunately does).  What they do have
            is marked content sections which correspond to content
            items in the logical structure tree.

        """
        from playa.structure import PageStructure

        if hasattr(self, "_structmap"):
            return self._structmap
        self._structmap: PageStructure = PageStructure(self.pageref, [])
        if self.doc.structure is None:
            return self._structmap
        parent_key = self.parent_key
        if parent_key is None:
            return self._structmap
        try:
            self._structmap = PageStructure(
                self.pageref, self.doc.structure.parent_tree[parent_key]
            )
        except (IndexError, TypeError) as e:
            log.warning("Invalid StructParents: %r (%s)", parent_key, e)
        return self._structmap

    @property
    def marked_content(self) -> Sequence[Union[None, Iterable["ContentObject"]]]:
        """Mapping of marked content IDs to iterators over content objects.

        These are the content objects associated with the structural
        elements in `Page.structure`.  So, for instance, you can do:

            for element, contents in zip(page.structure,
                                         page.marked_content):
                if element is not None:
                    if contents is not None:
                        for obj in contents:
                            ...  # do something with it

        Or you can also access the contents of a single element:

            if page.marked_content[mcid] is not None:
                for obj in page.marked_content[mcid]:
                    ... # do something with it

        Why do you have to check if it's `None`?  Because the values
        are not necessarily sequences (they may just be positions in
        the content stream), it isn't possible to know if they are
        empty without iterating over them, which you may or may not
        want to do, because you are Lazy.
        """
        if hasattr(self, "_marked_contents"):
            return self._marked_contents
        self._marked_contents: Sequence[Union[None, Iterable["ContentObject"]]] = (
            _make_contentmap(self)
        )
        return self._marked_contents

    @property
    def fonts(self) -> Mapping[str, Font]:
        """Mapping of resource names to fonts for this page.

        Note: This is not the same as `playa.Document.fonts`.
            The resource names (e.g. `F1`, `F42`, `FooBar`) here are
            specific to a page (or Form XObject) resource dictionary
            and have no relation to the font name as commonly
            understood (e.g. `Helvetica`,
            `WQERQE+Arial-SuperBold-HJRE-UTF-8`).  Since font names are
            generally considered to be globally unique, it may be
            possible to access fonts by them in the future.

        Note: This does not include fonts specific to Form XObjects.
            Since it is possible for the resource names to collide,
            this will only return the fonts for a page and not for any
            Form XObjects invoked on it.  You may use
            `XObjectObject.fonts` to access these.

        Danger: Do not rely on this being a `dict`.
            Currently this is implemented eagerly, but in the future it
            may return a lazy object which only loads fonts on demand.

        """
        if hasattr(self, "_fontmap"):
            return self._fontmap
        self._fontmap: Dict[str, Font] = _make_fontmap(
            self.resources.get("Font"), self.doc
        )
        return self._fontmap

    def __repr__(self) -> str:
        return f"<Page: Resources={self.resources!r}, MediaBox={self.mediabox!r}>"

    @overload
    def flatten(self) -> Iterator["ContentObject"]: ...

    @overload
    def flatten(self, filter_class: Type[CO]) -> Iterator[CO]: ...

    def flatten(
        self, filter_class: Union[None, Type[CO]] = None
    ) -> Iterator[Union[CO, "ContentObject"]]:
        """Iterate over content objects, recursing into form XObjects."""

        from typing import Set

        def flatten_one(
            itor: Iterable["ContentObject"], parents: Set[int]
        ) -> Iterator["ContentObject"]:
            for obj in itor:
                if isinstance(obj, XObjectObject):
                    stream_id = 0 if obj.stream.objid is None else obj.stream.objid
                    if stream_id not in parents:
                        yield from flatten_one(obj, parents | {stream_id})
                else:
                    yield obj

        if filter_class is None:
            yield from flatten_one(self, set())
        else:
            for obj in flatten_one(self, set()):
                if isinstance(obj, filter_class):
                    yield obj

    @property
    def mcid_texts(self) -> Mapping[int, List[str]]:
        """Mapping of marked content IDs to Unicode text strings.

        For use in text extraction from tagged PDFs.  This is a
        special case of `marked_content` which only cares about
        extracting text (and thus is quite a bit more efficient).

        Danger: Do not rely on this being a `dict`.
            Currently this is implemented eagerly, but in the future it
            may return a lazy object.

        """
        if hasattr(self, "_textmap"):
            return self._textmap
        self._textmap: Mapping[int, List[str]] = _extract_mcid_texts(self)
        return self._textmap

    def extract_text(self) -> str:
        """Do some best-effort text extraction.

        This necessarily involves a few heuristics, so don't get your
        hopes up.  It will attempt to use marked content information
        for a tagged PDF, otherwise it will fall back on the character
        displacement and line matrix to determine word and line breaks.
        """
        if self.doc.is_tagged:
            return self.extract_text_tagged()
        else:
            return self.extract_text_untagged()

    def extract_text_untagged(self) -> str:
        """Get text from a page of an untagged PDF."""

        def _extract_text_from_obj(
            obj: "TextObject", vertical: bool, prev_end: float
        ) -> Tuple[str, float]:
            """Try to get text from a text object."""
            chars: List[str] = []
            for glyph in obj:
                x, y = glyph.origin
                off = y if vertical else x
                # 0.5 here is a heuristic!!!
                if prev_end and off - prev_end > 0.5:
                    if chars and chars[-1] != " ":
                        chars.append(" ")
                if glyph.text is not None:
                    chars.append(glyph.text)
                dx, dy = glyph.displacement
                prev_end = off + (dy if vertical else dx)
            return "".join(chars), prev_end

        prev_end = 0.0
        prev_origin: Union[Point, None] = None
        lines = []
        strings: List[str] = []
        for text in self.texts:
            if text.gstate.font is None:
                continue
            vertical = text.gstate.font.vertical
            # Track changes to the translation component of text
            # rendering matrix to (yes, heuristically) detect newlines
            # and spaces between text objects
            dx, dy = text.origin
            off = dy if vertical else dx
            if strings and self._next_line(text, prev_origin):
                lines.append("".join(strings))
                strings.clear()
            # 0.5 here is a heuristic!!!
            if strings and off - prev_end > 0.5 and not strings[-1].endswith(" "):
                strings.append(" ")
            textstr, prev_end = _extract_text_from_obj(text, vertical, off)
            strings.append(textstr)
            prev_origin = dx, dy
        if strings:
            lines.append("".join(strings))
        return "\n".join(lines)

    def _next_line(
        self, text: Union[TextObject, None], prev_offset: Union[Point, None]
    ) -> bool:
        if text is None:
            return False
        if text.gstate.font is None:
            return False
        if prev_offset is None:
            return False
        offset = text.origin

        # Vertical text (usually) means right-to-left lines
        if text.gstate.font.vertical:
            line_offset = offset[0] - prev_offset[0]
        else:
            # The CTM isn't useful here because we actually do care
            # about the final device space, and we just want to know
            # which way is up and which way is down.
            dy = offset[1] - prev_offset[1]
            if self.space == "screen":
                line_offset = -dy
            else:
                line_offset = dy
        return line_offset < 0

    def extract_text_tagged(self) -> str:
        """Get text from a page of a tagged PDF."""
        lines: List[str] = []
        strings: List[str] = []
        prev_mcid: Union[int, None] = None
        prev_origin: Union[Point, None] = None
        # TODO: Iteration over marked content sections and getting
        # their text, origin, and displacement, will be refactored
        for mcs, texts in itertools.groupby(self.texts, operator.attrgetter("mcs")):
            text: Union[TextObject, None] = None
            # TODO: Artifact can also be a structure element, but
            # also, any content outside the structure tree is
            # considered an artifact
            if mcs is None or mcs.tag == "Artifact":
                for text in texts:
                    prev_origin = text.origin
                continue
            actual_text = mcs.props.get("ActualText")
            if actual_text is None:
                reversed = mcs.tag == "ReversedChars"
                c = []
                for text in texts:  # noqa: B031
                    c.append(text.chars[::-1] if reversed else text.chars)
                chars = "".join(c)
            else:
                assert isinstance(actual_text, bytes)
                # It's a text string so decode_text it
                chars = decode_text(actual_text)
                # Consume all text objects to ensure correct graphicstate
                for _ in texts:  # noqa: B031
                    pass

            # Remove soft hyphens
            chars = chars.replace("\xad", "")
            # There *might* be a line break, determine based on origin
            if mcs.mcid != prev_mcid:
                if self._next_line(text, prev_origin):
                    lines.extend(textwrap.wrap("".join(strings)))
                    strings.clear()
                prev_mcid = mcs.mcid
            strings.append(chars)
            if text is not None:
                prev_origin = text.origin
        if strings:
            lines.extend(textwrap.wrap("".join(strings)))
        return "\n".join(lines)

annotations property

Lazily iterate over page annotations.

contents property

Iterator over PDF objects in the content streams.

doc property

Get associated document if it exists.

fonts property

Mapping of resource names to fonts for this page.

Note: This is not the same as playa.Document.fonts. The resource names (e.g. F1, F42, FooBar) here are specific to a page (or Form XObject) resource dictionary and have no relation to the font name as commonly understood (e.g. Helvetica, WQERQE+Arial-SuperBold-HJRE-UTF-8). Since font names are generally considered to be globally unique, it may be possible to access fonts by them in the future.

Note: This does not include fonts specific to Form XObjects. Since it is possible for the resource names to collide, this will only return the fonts for a page and not for any Form XObjects invoked on it. You may use XObjectObject.fonts to access these.

Danger: Do not rely on this being a dict. Currently this is implemented eagerly, but in the future it may return a lazy object which only loads fonts on demand.

glyphs property

Iterator over lazy glyph objects.

height property

Width of the page in default user space units.

images property

Iterator over lazy image objects.

marked_content property

Mapping of marked content IDs to iterators over content objects.

These are the content objects associated with the structural elements in Page.structure. So, for instance, you can do:

for element, contents in zip(page.structure,
                             page.marked_content):
    if element is not None:
        if contents is not None:
            for obj in contents:
                ...  # do something with it

Or you can also access the contents of a single element:

if page.marked_content[mcid] is not None:
    for obj in page.marked_content[mcid]:
        ... # do something with it

Why do you have to check if it's None? Because the values are not necessarily sequences (they may just be positions in the content stream), it isn't possible to know if they are empty without iterating over them, which you may or may not want to do, because you are Lazy.

mcid_texts property

Mapping of marked content IDs to Unicode text strings.

For use in text extraction from tagged PDFs. This is a special case of marked_content which only cares about extracting text (and thus is quite a bit more efficient).

Danger: Do not rely on this being a dict. Currently this is implemented eagerly, but in the future it may return a lazy object.

parent_key property

Parent tree key for this page, if any.

paths property

Iterator over lazy path objects.

streams property

Return resolved content streams.

structure property

Mapping of marked content IDs to logical structure elements.

This is a sequence of logical structure elements, or None for unused marked content IDs. Note that because structure elements may contain multiple marked content sections, the same element may occur multiple times in this list.

It also has find and find_all methods which allow you to access enclosing structural elements (you can also use the parent method of elements for that)

Note: This is not the same as playa.Document.structure. PDF documents have logical structure, but PDF pages do not, and it is dishonest to pretend otherwise (as some code I once wrote unfortunately does). What they do have is marked content sections which correspond to content items in the logical structure tree.

texts property

Iterator over lazy text objects.

tokens property

Iterator over tokens in the content streams.

width property

Width of the page in default user space units.

xobjects property

Return resolved and rendered Form XObjects.

This does not return any image or PostScript XObjects. You can get images via the images property. Apparently you aren't supposed to use PostScript XObjects for anything, ever.

Note that these are the XObjects as rendered on the page, so you may see the same named XObject multiple times. If you need to access their actual definitions you'll have to look at page.resources.

This will also return Form XObjects within Form XObjects, except in the case of circular reference chains.

__iter__()

Iterator over lazy layout objects.

Source code in playa/page.py
262
263
264
def __iter__(self) -> Iterator["ContentObject"]:
    """Iterator over lazy layout objects."""
    return iter(LazyInterpreter(self, self._contents))

extract_text()

Do some best-effort text extraction.

This necessarily involves a few heuristics, so don't get your hopes up. It will attempt to use marked content information for a tagged PDF, otherwise it will fall back on the character displacement and line matrix to determine word and line breaks.

Source code in playa/page.py
493
494
495
496
497
498
499
500
501
502
503
504
def extract_text(self) -> str:
    """Do some best-effort text extraction.

    This necessarily involves a few heuristics, so don't get your
    hopes up.  It will attempt to use marked content information
    for a tagged PDF, otherwise it will fall back on the character
    displacement and line matrix to determine word and line breaks.
    """
    if self.doc.is_tagged:
        return self.extract_text_tagged()
    else:
        return self.extract_text_untagged()

extract_text_tagged()

Get text from a page of a tagged PDF.

Source code in playa/page.py
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
def extract_text_tagged(self) -> str:
    """Get text from a page of a tagged PDF."""
    lines: List[str] = []
    strings: List[str] = []
    prev_mcid: Union[int, None] = None
    prev_origin: Union[Point, None] = None
    # TODO: Iteration over marked content sections and getting
    # their text, origin, and displacement, will be refactored
    for mcs, texts in itertools.groupby(self.texts, operator.attrgetter("mcs")):
        text: Union[TextObject, None] = None
        # TODO: Artifact can also be a structure element, but
        # also, any content outside the structure tree is
        # considered an artifact
        if mcs is None or mcs.tag == "Artifact":
            for text in texts:
                prev_origin = text.origin
            continue
        actual_text = mcs.props.get("ActualText")
        if actual_text is None:
            reversed = mcs.tag == "ReversedChars"
            c = []
            for text in texts:  # noqa: B031
                c.append(text.chars[::-1] if reversed else text.chars)
            chars = "".join(c)
        else:
            assert isinstance(actual_text, bytes)
            # It's a text string so decode_text it
            chars = decode_text(actual_text)
            # Consume all text objects to ensure correct graphicstate
            for _ in texts:  # noqa: B031
                pass

        # Remove soft hyphens
        chars = chars.replace("\xad", "")
        # There *might* be a line break, determine based on origin
        if mcs.mcid != prev_mcid:
            if self._next_line(text, prev_origin):
                lines.extend(textwrap.wrap("".join(strings)))
                strings.clear()
            prev_mcid = mcs.mcid
        strings.append(chars)
        if text is not None:
            prev_origin = text.origin
    if strings:
        lines.extend(textwrap.wrap("".join(strings)))
    return "\n".join(lines)

extract_text_untagged()

Get text from a page of an untagged PDF.

Source code in playa/page.py
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
def extract_text_untagged(self) -> str:
    """Get text from a page of an untagged PDF."""

    def _extract_text_from_obj(
        obj: "TextObject", vertical: bool, prev_end: float
    ) -> Tuple[str, float]:
        """Try to get text from a text object."""
        chars: List[str] = []
        for glyph in obj:
            x, y = glyph.origin
            off = y if vertical else x
            # 0.5 here is a heuristic!!!
            if prev_end and off - prev_end > 0.5:
                if chars and chars[-1] != " ":
                    chars.append(" ")
            if glyph.text is not None:
                chars.append(glyph.text)
            dx, dy = glyph.displacement
            prev_end = off + (dy if vertical else dx)
        return "".join(chars), prev_end

    prev_end = 0.0
    prev_origin: Union[Point, None] = None
    lines = []
    strings: List[str] = []
    for text in self.texts:
        if text.gstate.font is None:
            continue
        vertical = text.gstate.font.vertical
        # Track changes to the translation component of text
        # rendering matrix to (yes, heuristically) detect newlines
        # and spaces between text objects
        dx, dy = text.origin
        off = dy if vertical else dx
        if strings and self._next_line(text, prev_origin):
            lines.append("".join(strings))
            strings.clear()
        # 0.5 here is a heuristic!!!
        if strings and off - prev_end > 0.5 and not strings[-1].endswith(" "):
            strings.append(" ")
        textstr, prev_end = _extract_text_from_obj(text, vertical, off)
        strings.append(textstr)
        prev_origin = dx, dy
    if strings:
        lines.append("".join(strings))
    return "\n".join(lines)

flatten(filter_class=None)

flatten() -> Iterator[ContentObject]
flatten(filter_class: Type[CO]) -> Iterator[CO]

Iterate over content objects, recursing into form XObjects.

Source code in playa/page.py
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
def flatten(
    self, filter_class: Union[None, Type[CO]] = None
) -> Iterator[Union[CO, "ContentObject"]]:
    """Iterate over content objects, recursing into form XObjects."""

    from typing import Set

    def flatten_one(
        itor: Iterable["ContentObject"], parents: Set[int]
    ) -> Iterator["ContentObject"]:
        for obj in itor:
            if isinstance(obj, XObjectObject):
                stream_id = 0 if obj.stream.objid is None else obj.stream.objid
                if stream_id not in parents:
                    yield from flatten_one(obj, parents | {stream_id})
            else:
                yield obj

    if filter_class is None:
        yield from flatten_one(self, set())
    else:
        for obj in flatten_one(self, set()):
            if isinstance(obj, filter_class):
                yield obj

set_initial_ctm(space, rotate)

Set or update initial coordinate transform matrix.

PDF 1.7 section 8.4.1: Initial value: a matrix that transforms default user coordinates to device coordinates.

We keep this as self.ctm in order to transform layout attributes in tagged PDFs which are specified in default user space (PDF 1.7 section 14.8.5.4.3, table 344)

If you wish to modify the rotation or the device space of the page, then you can do it with this method (the initial values are in the rotate and space properties).

Source code in playa/page.py
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
def set_initial_ctm(self, space: DeviceSpace, rotate: int) -> Matrix:
    """
    Set or update initial coordinate transform matrix.

    PDF 1.7 section 8.4.1: Initial value: a matrix that
    transforms default user coordinates to device coordinates.

    We keep this as `self.ctm` in order to transform layout
    attributes in tagged PDFs which are specified in default
    user space (PDF 1.7 section 14.8.5.4.3, table 344)

    If you wish to modify the rotation or the device space of the
    page, then you can do it with this method (the initial values
    are in the `rotate` and `space` properties).
    """
    # Normalize the rotation value
    rotate = (rotate + 360) % 360
    x0, y0, x1, y1 = self.mediabox
    width = x1 - x0
    height = y1 - y0
    self.ctm = MATRIX_IDENTITY
    if rotate == 90:
        # x' = y
        # y' = width - x
        self.ctm = (0, -1, 1, 0, 0, width)
    elif rotate == 180:
        # x' = width - x
        # y' = height - y
        self.ctm = (-1, 0, 0, -1, width, height)
    elif rotate == 270:
        # x' = height - y
        # y' = x
        self.ctm = (0, 1, -1, 0, height, 0)
    elif rotate != 0:
        log.warning(
            "Invalid rotation value %r (only multiples of 90 accepted)", rotate
        )
    # Apply this to the mediabox to determine device space
    (x0, y0, x1, y1) = transform_bbox(self.ctm, self.mediabox)
    width = x1 - x0
    height = y1 - y0
    # "screen" device space: origin is top left of MediaBox
    if space == "screen":
        self.ctm = mult_matrix(self.ctm, (1, 0, 0, -1, -x0, y1))
    # "page" device space: origin is bottom left of MediaBox
    elif space == "page":
        self.ctm = mult_matrix(self.ctm, (1, 0, 0, 1, -x0, -y0))
    # "default" device space: no transformation or rotation
    else:
        if space != "default":
            log.warning("Unknown device space: %r", space)
        self.ctm = MATRIX_IDENTITY
        width = height = 0
    self.space = space
    self.rotate = rotate
    return self.ctm

PDFTypeError

Bases: PDFException

TypeError, but for PDFs (not a subclass of TypeError, unlike in pdfminer.six)

Source code in playa/miner.py
109
110
111
112
113
114
class PDFTypeError(PDFException):
    """
    TypeError, but for PDFs (not a subclass of TypeError, unlike in pdfminer.six)
    """

    pass

PDFValueError

Bases: PDFException

ValueError, but for PDFs (not a subclass of ValueError, unlike in pdfminer.six)

Source code in playa/miner.py
117
118
119
120
121
122
class PDFValueError(PDFException):
    """
    ValueError, but for PDFs (not a subclass of ValueError, unlike in pdfminer.six)
    """

    pass

PSLiteral

A class that represents a PostScript literal.

Postscript literals are used as identifiers, such as variable names, property names and dictionary keys. Literals are case sensitive and denoted by a preceding slash sign (e.g. "/Name")

Note: Do not create an instance of PSLiteral directly. Always use PSLiteralTable.intern().

Source code in playa/pdftypes.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
class PSLiteral:
    """A class that represents a PostScript literal.

    Postscript literals are used as identifiers, such as
    variable names, property names and dictionary keys.
    Literals are case sensitive and denoted by a preceding
    slash sign (e.g. "/Name")

    Note: Do not create an instance of PSLiteral directly.
    Always use PSLiteralTable.intern().
    """

    def __init__(self, name: str) -> None:
        self.name = name

    def __repr__(self) -> str:
        return "/%r" % self.name

Plane

Bases: Generic[LTComponentT]

A set-like data structure for objects placed on a plane.

Can efficiently find objects in a certain rectangular area. It maintains two parallel lists of objects, each of which is sorted by its x or y coordinate.

Source code in playa/miner.py
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
class Plane(Generic[LTComponentT]):
    """A set-like data structure for objects placed on a plane.

    Can efficiently find objects in a certain rectangular area.
    It maintains two parallel lists of objects, each of
    which is sorted by its x or y coordinate.
    """

    def __init__(self, bbox: Rect, gridsize: int = 50) -> None:
        self._seq: List[LTComponentT] = []  # preserve the object order.
        self._objs: Dict[int, LTComponentT] = {}  # store unique objects
        self._grid: Dict[Point, List[LTComponentT]] = {}
        self.gridsize = gridsize
        (self.x0, self.y0, self.x1, self.y1) = bbox

    def __repr__(self) -> str:
        return "<Plane objs=%r>" % list(self)

    def __iter__(self) -> Iterator[LTComponentT]:
        for obj in self._seq:
            if id(obj) in self._objs:
                yield obj

    def __len__(self) -> int:
        return len(self._objs)

    def __contains__(self, obj: LTComponentT) -> bool:
        return id(obj) in self._objs

    def _getrange(self, bbox: Rect) -> Iterator[Point]:
        (x0, y0, x1, y1) = bbox
        if x1 <= self.x0 or self.x1 <= x0 or y1 <= self.y0 or self.y1 <= y0:
            return
        x0 = max(self.x0, x0)
        y0 = max(self.y0, y0)
        x1 = min(self.x1, x1)
        y1 = min(self.y1, y1)
        for grid_y in drange(y0, y1, self.gridsize):
            for grid_x in drange(x0, x1, self.gridsize):
                yield (grid_x, grid_y)

    def extend(self, objs: Iterable[LTComponentT]) -> None:
        for obj in objs:
            self.add(obj)

    def add(self, obj: LTComponentT) -> None:
        """Place an object."""
        for k in self._getrange((obj.x0, obj.y0, obj.x1, obj.y1)):
            if k not in self._grid:
                r: List[LTComponentT] = []
                self._grid[k] = r
            else:
                r = self._grid[k]
            r.append(obj)
        self._seq.append(obj)
        self._objs[id(obj)] = obj

    def remove(self, obj: LTComponentT) -> None:
        """Displace an object."""
        for k in self._getrange((obj.x0, obj.y0, obj.x1, obj.y1)):
            try:
                self._grid[k].remove(obj)
            except (KeyError, ValueError):
                pass
        del self._objs[id(obj)]

    def find(self, bbox: Rect) -> Iterator[LTComponentT]:
        """Finds objects that are in a certain area."""
        (x0, y0, x1, y1) = bbox
        done: Set[int] = set()
        for k in self._getrange(bbox):
            if k not in self._grid:
                continue
            for obj in self._grid[k]:
                if id(obj) in done:
                    continue
                done.add(id(obj))
                if obj.x1 <= x0 or x1 <= obj.x0 or obj.y1 <= y0 or y1 <= obj.y0:
                    continue
                yield obj

add(obj)

Place an object.

Source code in playa/miner.py
210
211
212
213
214
215
216
217
218
219
220
def add(self, obj: LTComponentT) -> None:
    """Place an object."""
    for k in self._getrange((obj.x0, obj.y0, obj.x1, obj.y1)):
        if k not in self._grid:
            r: List[LTComponentT] = []
            self._grid[k] = r
        else:
            r = self._grid[k]
        r.append(obj)
    self._seq.append(obj)
    self._objs[id(obj)] = obj

find(bbox)

Finds objects that are in a certain area.

Source code in playa/miner.py
231
232
233
234
235
236
237
238
239
240
241
242
243
244
def find(self, bbox: Rect) -> Iterator[LTComponentT]:
    """Finds objects that are in a certain area."""
    (x0, y0, x1, y1) = bbox
    done: Set[int] = set()
    for k in self._getrange(bbox):
        if k not in self._grid:
            continue
        for obj in self._grid[k]:
            if id(obj) in done:
                continue
            done.add(id(obj))
            if obj.x1 <= x0 or x1 <= obj.x0 or obj.y1 <= y0 or y1 <= obj.y0:
                continue
            yield obj

remove(obj)

Displace an object.

Source code in playa/miner.py
222
223
224
225
226
227
228
229
def remove(self, obj: LTComponentT) -> None:
    """Displace an object."""
    for k in self._getrange((obj.x0, obj.y0, obj.x1, obj.y1)):
        try:
            self._grid[k].remove(obj)
        except (KeyError, ValueError):
            pass
    del self._objs[id(obj)]

decode_text(s)

Decodes a text string (see PDF 1.7 section 7.9.2.2 - it could be PDFDocEncoding or UTF-16BE) to a str.

Source code in playa/utils.py
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
def decode_text(s: Union[str, bytes]) -> str:
    """Decodes a text string (see PDF 1.7 section 7.9.2.2 - it could
    be PDFDocEncoding or UTF-16BE) to a `str`.
    """
    # Sure, it could be UTF-16LE... \/\/hatever...
    if isinstance(s, bytes) and (
        s.startswith(b"\xfe\xff") or s.startswith(b"\xff\xfe")
    ):
        try:
            return s.decode("UTF-16")
        except UnicodeDecodeError:
            # Sure, it could have a BOM and not actually be UTF-16, \/\/TF...
            s = s[2:]
    try:
        # FIXME: This seems bad. If it's already a `str` then what are
        # those PDFDocEncoding characters doing in it?!?
        if isinstance(s, str):
            return "".join(PDFDocEncoding[ord(c)] for c in s)
        else:
            return "".join(PDFDocEncoding[c] for c in s)
    except IndexError:
        return str(s)

drange(v0, v1, d)

Returns a discrete range.

Source code in playa/miner.py
150
151
152
def drange(v0: float, v1: float, d: int) -> range:
    """Returns a discrete range."""
    return range(int(v0) // d, int(v1 + d) // d)

extract(path, laparams=None, max_workers=1, mp_context=None)

Extract LTPages from a document.

Source code in playa/miner.py
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
def extract(
    path: Path,
    laparams: Union[LAParams, None] = None,
    max_workers: Union[int, None] = 1,
    mp_context: Union[BaseContext, None] = None,
) -> Iterator[LTPage]:
    """Extract LTPages from a document."""
    if max_workers is None:
        max_workers = multiprocessing.cpu_count()
    with playa.open(
        path,
        space="page",
        max_workers=max_workers,
        mp_context=mp_context,
    ) as pdf:
        if max_workers == 1:
            for page in pdf.pages:
                yield extract_page(page, laparams)
        else:
            yield from pdf.pages.map(partial(extract_page, laparams=laparams))

extract_page(page, laparams=None)

Extract an LTPage from a Page, and possibly do some layout analysis.

Parameters:

Name Type Description Default
page Page

a Page as returned by PLAYA (please create this with space="page" if you want pdfminer.six compatibility).

required
laparams Union[LAParams, None]

if None, no layout analysis is done. Otherwise do some kind of heuristic magic that all "Artificial Intelligence" depends on but nobody actually understands.

None

Returns:

Type Description
LTPage

An analysis of the page as pdfminer.six would give you.

Source code in playa/miner.py
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
def extract_page(page: Page, laparams: Union[LAParams, None] = None) -> LTPage:
    """Extract an LTPage from a Page, and possibly do some layout analysis.

    Args:
        page: a Page as returned by PLAYA (please create this with
              space="page" if you want pdfminer.six compatibility).
        laparams: if None, no layout analysis is done. Otherwise do
                  some kind of heuristic magic that all "Artificial
                  Intelligence" depends on but nobody actually
                  understands.

    Returns:
        An analysis of the page as `pdfminer.six` would give you.
    """
    # This is the mediabox in device space rather than default user
    # space, which is the source of some confusion
    (x0, y0, x1, y1) = page.mediabox
    # Note that a page can never be rotated by a non-multiple of 90
    # degrees (pi / 2 for nerds) so that's why we only care about two
    # of its corners
    (x0, y0) = apply_matrix_pt(page.ctm, (x0, y0))
    (x1, y1) = apply_matrix_pt(page.ctm, (x1, y1))
    # FIXME: The translation of the mediabox here is useless due to
    # the above transformation (but this should be verified against
    # pdfminer.six)
    mediabox = (0, 0, abs(x0 - x1), abs(y0 - y1))
    ltpage = LTPage(page.page_idx + 1, mediabox)

    # Emulating PDFLayoutAnalyzer is fairly simple and maps almost
    # directly onto PLAYA's lazy API.  XObjects and inline images
    # produce an LTFigure, characters produce an LTChar, everything
    # else produces an LTLine, LTRect, or LTCurve.
    for obj in page:
        # Put this in some functions to avoid isinstance abuse
        for item in process_object(obj):
            ltpage.add(item)

    if laparams is not None:
        ltpage.analyze(laparams)

    return ltpage

fsplit(pred, objs)

Split a list into two classes according to the predicate.

Source code in playa/miner.py
138
139
140
141
142
143
144
145
146
147
def fsplit(pred: Callable[[_T], bool], objs: Iterable[_T]) -> Tuple[List[_T], List[_T]]:
    """Split a list into two classes according to the predicate."""
    t = []
    f = []
    for obj in objs:
        if pred(obj):
            t.append(obj)
        else:
            f.append(obj)
    return t, f

make_path_segment(op, points)

Create a type-safe PathSegment, unlike pdfminer.six.

Source code in playa/miner.py
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
def make_path_segment(op: PathOperator, points: List[Point]) -> PathSegment:
    """Create a type-safe PathSegment, unlike pdfminer.six."""
    if len(points) == 0:
        if op != "h":
            raise ValueError("Incorrect arguments for {op!r}: {points!r}")
        return (str(op),)
    if len(points) == 1:
        if op not in "ml":
            raise ValueError("Incorrect arguments for {op!r}: {points!r}")
        return (str(op), points[0])
    if len(points) == 2:
        if op not in "vy":
            raise ValueError("Incorrect arguments for {op!r}: {points!r}")
        return (str(op), points[0], points[1])
    if len(points) == 3:
        if op != "c":
            raise ValueError("Incorrect arguments for {op!r}: {points!r}")
        return (str(op), points[0], points[1], points[2])
    raise ValueError(f"Path segment has unknown number of points: {op!r} {points!r}")

process_object(obj)

Handle obj according to its type

Source code in playa/miner.py
1223
1224
1225
1226
@singledispatch
def process_object(obj: ContentObject) -> Iterator[LTComponent]:
    """Handle obj according to its type"""
    yield from ()

resolve1(x, default=None)

Resolves an object.

If this is an array or dictionary, it may still contains some indirect objects inside.

Source code in playa/pdftypes.py
225
226
227
228
229
230
231
232
233
def resolve1(x: PDFObject, default: PDFObject = None) -> PDFObject:
    """Resolves an object.

    If this is an array or dictionary, it may still contains
    some indirect objects inside.
    """
    while isinstance(x, ObjRef):
        x = x.resolve(default=default)
    return x

resolve_all(x, default=None)

Resolves all indirect object references inside the given object.

This creates new copies of any lists or dictionaries, so the original object is not modified. However, it will ultimately create circular references if they exist, so beware.

Source code in playa/pdftypes.py
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
def resolve_all(x: PDFObject, default: PDFObject = None) -> PDFObject:
    """Resolves all indirect object references inside the given object.

    This creates new copies of any lists or dictionaries, so the
    original object is not modified.  However, it will ultimately
    create circular references if they exist, so beware.
    """

    def resolver(
        x: PDFObject, default: PDFObject, seen: Dict[int, PDFObject]
    ) -> PDFObject:
        if isinstance(x, ObjRef):
            ref = x
            while isinstance(x, ObjRef):
                if x.objid in seen:
                    return seen[x.objid]
                x = x.resolve(default=default)
            seen[ref.objid] = x
        if isinstance(x, list):
            return [resolver(v, default, seen) for v in x]
        elif isinstance(x, dict):
            return {k: resolver(v, default, seen) for k, v in x.items()}
        return x

    return resolver(x, default, {})

subpaths(path)

Iterate over "subpaths".

Note: subpaths inherit the values of fill and evenodd from the parent path, but these values are no longer meaningful since the winding rules must be applied to the composite path as a whole (this is not a bug, just don't rely on them to know which regions are filled or not).

Source code in playa/miner.py
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
def subpaths(path: PathObject) -> Iterator[PathObject]:
    """Iterate over "subpaths".

    Note: subpaths inherit the values of `fill` and `evenodd` from
    the parent path, but these values are no longer meaningful
    since the winding rules must be applied to the composite path
    as a whole (this is not a bug, just don't rely on them to know
    which regions are filled or not).

    """
    # FIXME: Is there an itertool or a more_itertool for this?
    segs: List[PLAYAPathSegment] = []
    for seg in path.raw_segments:
        if seg.operator == "m" and segs:
            yield PathObject(
                _pageref=path._pageref,
                _parentkey=path._parentkey,
                gstate=path.gstate,
                ctm=path.ctm,
                mcstack=path.mcstack,
                raw_segments=segs,
                stroke=path.stroke,
                fill=path.fill,
                evenodd=path.evenodd,
            )
            segs = []
        segs.append(seg)
    if segs:
        yield PathObject(
            _pageref=path._pageref,
            _parentkey=path._parentkey,
            gstate=path.gstate,
            ctm=path.ctm,
            mcstack=path.mcstack,
            raw_segments=segs,
            stroke=path.stroke,
            fill=path.fill,
            evenodd=path.evenodd,
        )

uniq(objs)

Eliminates duplicated elements.

Source code in playa/miner.py
125
126
127
128
129
130
131
132
133
134
135
def uniq(objs: Iterable[_T]) -> Iterator[_T]:
    """Eliminates duplicated elements."""
    # Duplicated here means the same object (this horrible code was
    # horribly written without any notion of hashable or non-hashable
    # types, SMH)
    done: Set[int] = set()
    for obj in objs:
        if id(obj) in done:
            continue
        done.add(id(obj))
        yield obj