Create PDFs from DOC, not DOCX Files

We learned a lesson tonight when I was trying to submit a script to a production company: PDFs from DOC files are much, much smaller than PDFs generated from DOCX files.

Microsoft Word migrated from the familiar ".DOC" format of Word 97-2004 with the release of Word 2007/2008 (Windows/OS X). I recall the painful transition from Word 95 to Word 97, but nothing has compared to the nightmare that is the DOCX "Office XML" file format. I appreciate the idea of XML-based documents. Unfortunately, Microsoft's DOCX seems to cause a fair amount of pain.

The 101-page script stored as a DOCX refused to convert to a compressed and optimized PDF with Acrobat Distiller, Acrobat Pro, or Apple's built-in PDF driver. This left me able to create only an uncompressed PDF. The file was 62 megabytes! A 184 kilobyte document exploded to 62MB… and it couldn't be emailed through our server.

Saving the document as a DOC file, the document grew to 214KB, a bit larger than the DOCX. However, when a PDF was generated it was only 800KB. Not that 800KB is great, but it is much better than megabytes of bloat.

I often tell my students to save documents in DOC format, instead of DOCX, if they intend to email a document. I never considered that the DOC/DOCX differences would affect PDF output.

In trying to "help" the layout, Microsoft's DOCX format includes a lot of redundant font and layout information. Although I didn't have any graphics in my script, the DOCX format also links to higher resolution images than the DOC format supports. I examined the PDF output from Word 2011 (OS X) and discovered nearly 100 font "embed" occurrences. The problem is that Word styles are assigned multiple times — for no apparent reason.

My script template uses six major paragraph styles. In DOC, HTML, or RTF files, the styles would be defined once, at the top of the document. But, that's not the DOCX way.

You might imagine "Character Name" would be a single style that is assigned to all paragraphs that are used to mark when a character speaks. But, no, Microsoft's DOCX included two dozen "Character Name" styles, each assigned to varying number of paragraphs. It makes no sense at all to me. During the PDF creation, it seems fonts are embedded repeatedly with the styles. I'd have to do some forensic work to discover what is happening in greater detail.

No matter what the cause, the best way to create a PDF from Word appears to be saving a document as a "DOC" file first.

I get that hard drives are cheap and broadband is fast, but that's no defense for lousy file formats. More is not always better, as Microsoft's bloated file formats constantly demonstrate. Unfortunately, Microsoft's bloat adds to Adobe's bloat.


