Skip to main content

Create PDFs from DOC, not DOCX Files

We learned a lesson tonight when I was trying to submit a script to a production company: PDFs from DOC files are much, much smaller than PDFs generated from DOCX files.

Microsoft Word migrated from the familiar ".DOC" format of Word 97-2004 with the release of Word 2007/2008 (Windows/OS X). I recall the painful transition from Word 95 to Word 97, but nothing has compared to the nightmare that is the DOCX "Office XML" file format. I appreciate the idea of XML-based documents. Unfortunately, Microsoft's DOCX seems to cause a fair amount of pain.

The 101-page script stored as a DOCX refused to convert to a compressed and optimized PDF with Acrobat Distiller, Acrobat Pro, or Apple's built-in PDF driver. This left me able to create only an uncompressed PDF. The file was 62 megabytes! A 184 kilobyte document exploded to 62MB… and it couldn't be emailed through our server.

Saving the document as a DOC file, the document grew to 214KB, a bit larger than the DOCX. However, when a PDF was generated it was only 800KB. Not that 800KB is great, but it is much better than megabytes of bloat.

I often tell my students to save documents in DOC format, instead of DOCX, if they intend to email a document. I never considered that the DOC/DOCX differences would affect PDF output.

In trying to "help" the layout, Microsoft's DOCX format includes a lot of redundant font and layout information. Although I didn't have any graphics in my script, the DOCX format also links to higher resolution images than the DOC format supports. I examined the PDF output from Word 2011 (OS X) and discovered nearly 100 font "embed" occurrences. The problem is that Word styles are assigned multiple times — for no apparent reason.

My script template uses six major paragraph styles. In DOC, HTML, or RTF files, the styles would be defined once, at the top of the document. But, that's not the DOCX way.

You might imagine "Character Name" would be a single style that is assigned to all paragraphs that are used to mark when a character speaks. But, no, Microsoft's DOCX included two dozen "Character Name" styles, each assigned to varying number of paragraphs. It makes no sense at all to me. During the PDF creation, it seems fonts are embedded repeatedly with the styles. I'd have to do some forensic work to discover what is happening in greater detail.

No matter what the cause, the best way to create a PDF from Word appears to be saving a document as a "DOC" file first.

I get that hard drives are cheap and broadband is fast, but that's no defense for lousy file formats. More is not always better, as Microsoft's bloated file formats constantly demonstrate. Unfortunately, Microsoft's bloat adds to Adobe's bloat.

Comments

Popular posts from this blog

MarsEdit and Blogging

MarsEdit (Photo credit: Wikipedia ) Mailing posts to blogs, a practice I adopted in 2005, allows a blogger like me to store copies of draft posts within email. If Blogger , WordPress, or the blogging platform of the moment crashes or for some other reason eats my posts, at least I have the original drafts of most entries. I find having such a nicely organized archive convenient — much easier than remembering to archive posts from Blogger or WordPress to my computer. With this post, I am testing MarsEdit from Red Sweater Software based on recent reviews, including an overview on 9to5Mac . Composing posts an email offers a fast way to prepare draft blogs, but the email does not always work well if you want to include basic formatting, images, and links to online resources. Submitting to Blogger via Apple Mail often produced complex HTML with unnecessary font and paragraph formatting styles. Problems with rich text led me to convert blog entries to plaintext in Apple Mail

Learning to Code: Comments Count

I like comments in computer programming source code. I've never been the programmer to claim, "My code doesn't need comments." Maybe it is because I've always worked on so many projects that I need comments  to remind me what I was thinking when I entered the source code into the text editor. Most programmers end up in a similar situation. They look at a function and wonder, "Why did I do it this way?" Tangent : I also like comments in my "human" writing projects. One of the sad consequences of moving to digital media is that we might lose all the little marginalia authors and editors leave on manuscript drafts. That thought, the desire to preserve my notes, is worthy of its own blog post — so watch for a post on writing software and notes. Here are my rules for comments: Source code files should begin with identifying comments and an update log. Functions, subroutines, and blocks of code should have at least one descriptive comment.

Learning to Program

Late last night I installed the update to Apple's OS X programming tool suite, Xcode 4. This summer, in my "free" time I intend to work my way through my old copy of Teach Yourself C and the several Objective-C books I own. While I do play with various languages and tools, from AppleScript to PHP, I've never managed to master Objective-C — which is something I want to do. As I've written several times, knowing simple coding techniques is a practical skill and one that helps learn problem solving strategies. Even my use of AppleScript and Visual Basic for Applications (VBA) on a regular basis helps remind me to tackle problems in distinct steps, with clear objectives from step to step. There are many free programming tools that students should be encouraged to try. On OS X, the first two tools I suggest to non-technical students are Automator and AppleScript. These tools allow you to automate tasks on OS X, similar to the batch files of DOS or the macros of Wor