Esoteric Mysteries of the PDF
by Sarah Warlick, content director
Have you ever copied text from a PDF file and pasted it somewhere else, comfortable in the knowledge that it had already been proofed and approved, only to find that certain letters were missing? If not, it’s time to get a little uncomfortable. That’s because there’s a strong likelihood you have used this finalized copy and never noticed that ‘fi’ had been systematically removed, leaving only a blank space where the letters were intended to appear.
This and other weird phenomena can sometimes happen when copying from PDF documents to Word. The variability stems from the fact that PDF creation tools rely on specific “behind-the-scenes” protocols that have changed over the years. An older version may replace ‘fi’ with that sneaky blank space, where a document made with a newer PDF generator will show the words intact.
But what causes the problem to begin with? Here we must dig deep into the mysterious world of codes, glyphs, fonts and metadata. To begin with, let’s look at the nature of PDFs themselves. These documents are basically electronic images that usually combine graphics and text into a single format (the Portable Document Format, thus PDF). Unlike word processing software like MS Word or Macintosh Pages and similar tools, PDF creators don’t concern themselves with the text itself as an entity involving words that need to make sense. These tools instead focus on the way the whole file looks in order to create a complete image that’s suitable for rendering online as an image or attachment, or being printed on paper or another medium.
Take a deep breath, we’re going to get a bit technical, but just bear with me. Text included in a PDF is broken down into glyphs, or individual symbols. These are identified with codes that can be mapped to Unicode or ASCII in many cases, but not always. The PDF-producing application may or may not include metadata that recognizes the text as text rather than merely a long series of glyphs to be reproduced. This means that you might not be able to copy the text at all, search it for particular words or copy and paste it accurately.
PDF readers have the difficult task of taking the coded data and reproducing it as it appeared in the design tool, including the fonts that were originally embedded. The number of possible fonts is staggering – 238,982,375 for LaTeX, which is an older PDF production protocol. The readers must also determine where a space should appear, usually by identifying variation in the space between the glyphs in the code.
There’s a lot of room for miscalculation in this complicated sequence of coding, translating and recoding. Ligatures (sets of joined letters) are particularly prone to trouble, as certain pairs of letters often appear as a ligature in one font but not another. The letters ‘f’ and ‘i’ frequently form a ligature when they appear in that sequence. If the application that creates a PDF handles the two letters as a single ligature and then the reader application does not utilize the same font, the latter tool can identify a spacing anomaly in the glyph series and thus assume it’s time to add a space. It displays the space obediently, in the process wreaking havoc on what was once lovely copy.
One way to avoid this type of problem is to encourage the use of current PDF production applications, which utilize the newer XeLaTeX protocol. But since converting data to and from the PDF format involves tons of complex data translation, the only real answer is to carefully re-proof every bit of copy that comes out. Yes, it adds time, but it’s better than unwittingly sharing content that urges readers to approach the nish line with con dence or learn about the rm’s nancial services today, or something just as silly. See how that works?