Skip to main content

Create PDFs from DOC, not DOCX Files

We learned a lesson tonight when I was trying to submit a script to a production company: PDFs from DOC files are much, much smaller than PDFs generated from DOCX files.

Microsoft Word migrated from the familiar ".DOC" format of Word 97-2004 with the release of Word 2007/2008 (Windows/OS X). I recall the painful transition from Word 95 to Word 97, but nothing has compared to the nightmare that is the DOCX "Office XML" file format. I appreciate the idea of XML-based documents. Unfortunately, Microsoft's DOCX seems to cause a fair amount of pain.

The 101-page script stored as a DOCX refused to convert to a compressed and optimized PDF with Acrobat Distiller, Acrobat Pro, or Apple's built-in PDF driver. This left me able to create only an uncompressed PDF. The file was 62 megabytes! A 184 kilobyte document exploded to 62MB… and it couldn't be emailed through our server.

Saving the document as a DOC file, the document grew to 214KB, a bit larger than the DOCX. However, when a PDF was generated it was only 800KB. Not that 800KB is great, but it is much better than megabytes of bloat.

I often tell my students to save documents in DOC format, instead of DOCX, if they intend to email a document. I never considered that the DOC/DOCX differences would affect PDF output.

In trying to "help" the layout, Microsoft's DOCX format includes a lot of redundant font and layout information. Although I didn't have any graphics in my script, the DOCX format also links to higher resolution images than the DOC format supports. I examined the PDF output from Word 2011 (OS X) and discovered nearly 100 font "embed" occurrences. The problem is that Word styles are assigned multiple times — for no apparent reason.

My script template uses six major paragraph styles. In DOC, HTML, or RTF files, the styles would be defined once, at the top of the document. But, that's not the DOCX way.

You might imagine "Character Name" would be a single style that is assigned to all paragraphs that are used to mark when a character speaks. But, no, Microsoft's DOCX included two dozen "Character Name" styles, each assigned to varying number of paragraphs. It makes no sense at all to me. During the PDF creation, it seems fonts are embedded repeatedly with the styles. I'd have to do some forensic work to discover what is happening in greater detail.

No matter what the cause, the best way to create a PDF from Word appears to be saving a document as a "DOC" file first.

I get that hard drives are cheap and broadband is fast, but that's no defense for lousy file formats. More is not always better, as Microsoft's bloated file formats constantly demonstrate. Unfortunately, Microsoft's bloat adds to Adobe's bloat.

Comments

Popular posts from this blog

Practical Technology Skills

This blog is a revision to a column I wrote for Direct Media publications. Normally, I wouldn't repost something I wrote for hire, and I certainly don't wish to anger one of my publishers. However, since this blog is primarily accessed by one of my graduate seminars, I think the publisher will appreciate that I am extending my thoughts for educational purposes. I'm also more than willing to encourage businesses to visit the Direct Media home page . Page numbers seemed to be a half-inch lower on each successive page. I stared at the mid-term paper, handed in to me by a junior at the university, and thought back to my fights with dot-matrix printers. When I was an undergrad, my Epson FX/80 printer jammed often and would sometimes rip pages after the sprockets slipped out of alignment with the punched holes of the perforated paper. Surely the undergraduate author of this paper suffered the curse of a similarly possessed printer, I told myself. “I guess when I changed the ma...

Pursuing a University Degree Online

Visalia Direct: Virtual Valley February 2008 Issue January 7, 2008 Pursuing a University Degree Online When a star high school student graduates in Tulare County, the difficult reality is that he or she most likely will leave to attend a four-year university. For an eighteen-year-old student, leaving the Central Valley, or at least Tulare County, is part of the educational experience. But, after returning to Visalia some of us find out that our undergraduate educations are not quite enough. For those in education, Fresno State, Fresno Pacific University, Chapman University, and others have offered courses in Visalia for a number of years. This makes it possible to work and still complete a teaching credential or an advanced education-related degree. I have been thankful for the options we have in the Central Valley. But, as others have learned, if you are interested in some fields you must commute to Fresno — or even further. With the drive to Fresno taking just under an hour...

MarsEdit and Blogging

MarsEdit (Photo credit: Wikipedia ) Mailing posts to blogs, a practice I adopted in 2005, allows a blogger like me to store copies of draft posts within email. If Blogger , WordPress, or the blogging platform of the moment crashes or for some other reason eats my posts, at least I have the original drafts of most entries. I find having such a nicely organized archive convenient — much easier than remembering to archive posts from Blogger or WordPress to my computer. With this post, I am testing MarsEdit from Red Sweater Software based on recent reviews, including an overview on 9to5Mac . Composing posts an email offers a fast way to prepare draft blogs, but the email does not always work well if you want to include basic formatting, images, and links to online resources. Submitting to Blogger via Apple Mail often produced complex HTML with unnecessary font and paragraph formatting styles. Problems with rich text led me to convert blog entries to plaintext in Apple Mail ...