Text streams in PDF files
[Editors Note: This piece originally appeared in Mark Stephens Java PDF blog which I recommend you visit if you’re interested in the technical side of PDF. Mark is the CEO of IDRsolutions, the company behind JPedal, a 100% Java PDF library. Visit his LinkedIn page for more information.]
Inside a PDF is a Postscript stream of commands which describe the page – they draw the text, images or shapes. You can extract this stream and look at it directly. It looks like this -I have added comments in brackets after each command to explain.
BT (begin a block of text)
/F13 12 Tf (Choose Font F13 and set size to 12)
288 720 Td (move the location relative from where it now is
(ABC) Tj (Draw the Text ABC)
ET (End the text block)
So far so good, but this code is actually rather deceptive. Most people assume from looking at it that Tj take a String (ABC), but it does not. It actually contains a set of binary index values. These are then decoded using the Fonts inbuilt decoding – it can be one of the Standard Encodings (WIN, MAC, EXPERT, etc) which are defined in Appendix D of the PDF Reference. For subsetted fonts (where only the characters used in the PDF are included) they could be any arbitrary set of values – they will have no meaning until you look them up with the Fonts custom encoding table (the Differences Object).
The reason they look like text in the example above and those in the PDF Reference guide are because the vales for WIN encoding happen to be the same as the ASCII characters. So the binary value for A shows up as A if it is WIN encoded.
However, they are not actually text values and should not be treated as such unless you can guarantee that the only PDF files you look at will be WIN encoded. Otherwise you will get a very nasty surprise on some PDF files…