PDF mystery – what is the correct value for a Text Field?
[Editors Note: This piece originally appeared on Mark Stephens Java PDF blog which I recommend you visit if you’re interested in the technical side of PDF. Mark is the CEO of IDRsolutions, the company behind JPedal, a 100% Java PDF library. Visit his LinkedIn page for more information.]
I came across an interesting issue with PDF Text fields while debugging a file this week. We were sent a 2 page document created with iText, containing some text fields and we were displaying both pages with text fields containing identical values – they appear different in Acrobat. Obviously Acrobat is always right (even when it disagrees with the PDF specification) so we dug deeper to see what was going on…
With PDF forms, all form objects can share common Parent objects and they can then inherit values from them. So if a text field does not have a text value, it can inherit its Parent’s value. This is really useful because you can avoid having to repeat common values.
In this PDF, the Text fields on both pages shared the same Parent and because they had no text values, we were inheriting the value from the Parent. So our viewer displayed the same text value on both pages. However, form objects can also have an Appearance Stream which defines the display of the form object. This is what accounts for the different appearance.
So I found out that it is “allowed” to have 2 forms with different Appearance Streams, with a single parent that defined the text value for the field. So they both had the same text value but the appearance was different.
So either the appearance over-rides the text value in read only text fields, or the child value is more important in defining the display of the form. So in this example the appearance streams are more important than the text value of the form object.
Its not an ideal way to work, because any software reading the text value for the form will not get the value which the user sees. For reading text values, the file is essentially broken. But our viewer now displays it as Adobe would (which is all most users care about at the end of the day).
We are working on a way to generate a readable string from the appearance stream, so we can make this file more useful in text extraction, so keep watching this space.
So that is another mystery solved for me, and yet another way to interpret the spec. Have you come across any interesting and mysterious PDF files where things are not as they should be?