----------------------- Mail item text follows ---------------
To: I1002812--IBMMAIL NFB R&D Committee
FROM: Steve Jacobson - IT Order Proc. Mktg. and Dist.
3M Company - 555-01-03 Phone: (612) 733-9780
St. Paul, MN 55144 FAX: (612) 736-6037
Subject: PDF Document Format
In reading the notes from Mr. Raman regarding the PDF Document format
and the conversion of Government documents, I find that I am not
entirely clear on a couple of points. In fact, it seems to me that there
may be two issues here rather than one.
First, what is meant by structure? From Dr. Raman's first note, I take
"structure" to mean information about the structure of the data that is
in addition to the data itself. If my guess is correct, a simple ASCII
version of any given form will also not contain structure information.
For example, IRS tax forms display data in a number of ways, as do most
forms. Most data is displayed on lines with a label at the left and a
value at the right. However, there can be more than one set of labels
and values on one physical line. These forms also contain check boxes
and information boxes where information such as employer's
name and address occupy multiple lines with a descriptive
label at the top. If one looks at an ASCII version of this form, it is
at the very least confusing. To fill out such a form in ASCII would be
tedious at best. However, it could probably be done, and an ASCII form
could be imported into a form filler program where, with much work, it
could be filled in. This is precisely part of what is done by income tax
software. However, from my experience this year, tax software screens
can be as bad as the forms. Yet, to my knowledge, we have never said
that accessibility to documents must include reformatting them into a
more useful format. Further, I am not certain that we should since
producers of documents won't know what is useful. As I see it, none of
the foregoing has anything to do with PDF document format, since the
kind of structure information that describes a "check box" is not part
of a standard ASCII representation either.
Then, what about PDF? I don't know enough about it to know how it
compares with other systems with respect to extracting document contents
in ASCII. Does Dr. Raman know of other systems? I would be interested in
gaining a better understanding of how they differ? We should not reject
the possibility, though, that a PDF document converted into, say,
WordPerfect might provide more information about the original document
than would an ASCII file. For example, knowing what was written in a
bold font might indicate the level of importance. This is the kind of
thing that a package like Word for Word can accomplish, but it still
won't make IRS forms all that useable.
To really judge whether a given approach to document exchange is
accessible to us, we need to define two things. What do we want or need
to know about the document appearance, and how much structure
information should we expect to get? The PDF document format may
determine what we can learn about a document's appearance, but the
structure issue goes beyond PDF. Even the proposals of ICAD may not
permit easy completion of an IRS form.
Having said all of this, I would be interested in knowing whether there
is a system, to Dr. Raman's knowledge, that would convey all of the
information necessary to fill out an IRS form? What recommendations
would he make for the storage of government information. I am not asking
these questions to defend PDF or to confront Dr. Raman. Rather, what
options are there; what should we be encouraging the government to do?
PDF may not be the problem, at least not the entire problem.
Regards,
Steve Jacobson
IBMMAIL: USMMMXBL
INTERNET: USMMMXBL@IBMMAIL.COM
This archive was generated by hypermail 2b29 : Sun Dec 02 2012 - 01:30:03 PST