Skip to main content

Create PDFs from DOC, not DOCX Files

We learned a lesson tonight when I was trying to submit a script to a production company: PDFs from DOC files are much, much smaller than PDFs generated from DOCX files.

Microsoft Word migrated from the familiar ".DOC" format of Word 97-2004 with the release of Word 2007/2008 (Windows/OS X). I recall the painful transition from Word 95 to Word 97, but nothing has compared to the nightmare that is the DOCX "Office XML" file format. I appreciate the idea of XML-based documents. Unfortunately, Microsoft's DOCX seems to cause a fair amount of pain.

The 101-page script stored as a DOCX refused to convert to a compressed and optimized PDF with Acrobat Distiller, Acrobat Pro, or Apple's built-in PDF driver. This left me able to create only an uncompressed PDF. The file was 62 megabytes! A 184 kilobyte document exploded to 62MB… and it couldn't be emailed through our server.

Saving the document as a DOC file, the document grew to 214KB, a bit larger than the DOCX. However, when a PDF was generated it was only 800KB. Not that 800KB is great, but it is much better than megabytes of bloat.

I often tell my students to save documents in DOC format, instead of DOCX, if they intend to email a document. I never considered that the DOC/DOCX differences would affect PDF output.

In trying to "help" the layout, Microsoft's DOCX format includes a lot of redundant font and layout information. Although I didn't have any graphics in my script, the DOCX format also links to higher resolution images than the DOC format supports. I examined the PDF output from Word 2011 (OS X) and discovered nearly 100 font "embed" occurrences. The problem is that Word styles are assigned multiple times — for no apparent reason.

My script template uses six major paragraph styles. In DOC, HTML, or RTF files, the styles would be defined once, at the top of the document. But, that's not the DOCX way.

You might imagine "Character Name" would be a single style that is assigned to all paragraphs that are used to mark when a character speaks. But, no, Microsoft's DOCX included two dozen "Character Name" styles, each assigned to varying number of paragraphs. It makes no sense at all to me. During the PDF creation, it seems fonts are embedded repeatedly with the styles. I'd have to do some forensic work to discover what is happening in greater detail.

No matter what the cause, the best way to create a PDF from Word appears to be saving a document as a "DOC" file first.

I get that hard drives are cheap and broadband is fast, but that's no defense for lousy file formats. More is not always better, as Microsoft's bloated file formats constantly demonstrate. Unfortunately, Microsoft's bloat adds to Adobe's bloat.


Popular posts from this blog

What I Studied in Graduate School

Lower case ‘a’ from Adobe Caslon Pro, superposed onto some guides. (Photo credit: Wikipedia) Asked to summarize my research projects...

Curiously, beyond the theses and dissertation, all my work is in economics of media and narrative. I ask what works and why when offering stories to audiences. What connects with an audience and can we model what audiences want from narratives? (Yes, you can model data on narratives and what "sells" and what wins awards and what nobody wants.)

Yet, my degree research projects all relate to design of writing spaces, as knowing what works is also key to knowing what could be "sold" to users.

MA: How poor LMS UI/UX design creates online spaces that hinder the writing process and teacher mentoring of students.

Also: The cost of LMS design and compliance with legal mandates for usability.

Ph.D: The experiences of special needs students in online settings, from commercial spaces to games to learning spaces and which spaces are best desig…

Comic Sans Is (Generally) Lousy: Letters and Reading Challenges

Specimen of the typeface Comic Sans. (Photo credit: Wikipedia) Personally, I support everyone being able to type and read in whatever typefaces individuals prefer. If you like Comic Sans, then change the font while you type or read online content. If you like Helvetica, use that.

The digital world is not print. You can change typefaces. You can change their sizes. You can change colors. There is no reason to argue over what you use to type or to read as long as I can use typefaces that I like.

Now, as a design researcher? I'll tell you that type matters a lot to both the biological act of reading and the psychological act of constructing meaning. Statistically, there are "better" and "worse" type for conveying messages. There are also typefaces that are more legible and more readable. Sometimes, legibility does not help readability, either, as a type with overly distinct letters (legibility) can hinder word shapes and decoding (readability).

One of the co…

MarsEdit and Blogging

MarsEdit (Photo credit: Wikipedia) Mailing posts to blogs, a practice I adopted in 2005, allows a blogger like me to store copies of draft posts within email. If Blogger, WordPress, or the blogging platform of the moment crashes or for some other reason eats my posts, at least I have the original drafts of most entries. I find having such a nicely organized archive convenient — much easier than remembering to archive posts from Blogger or WordPress to my computer.

With this post, I am testing MarsEdit from Red Sweater Software based on recent reviews, including an overview on 9to5Mac.

Composing posts an email offers a fast way to prepare draft blogs, but the email does not always work well if you want to include basic formatting, images, and links to online resources. Submitting to Blogger via Apple Mail often produced complex HTML with unnecessary font and paragraph formatting styles. Problems with rich text led me to convert blog entries to plaintext in Apple Mail and then format th…