Pages

Sunday, June 28, 2009

Importing Microsoft Word Documents Into Blogger

I have a lot of old papers that I decided to post here on my blog. They are all Word 2003 documents and many of them have quite a bit of formatting which I did not want to replicate by hand in HTML. So, I have tried a bunch of weird tricks and figured out a system for importing Word documents, preserving the formatting in the post while avoiding messing up the CSS styles in the rest of the blog page. This is a relatively convoluted procedure. However, once you try it a couple of times it will work relatively quickly, at least compared to reformatting a 20 page research paper by hand.

One of the steps in this procedure is made a little easier if you have Adobe DreamWeaver. I use CS3 but I don’t know for sure which older versions have the features you will need. It is not absolutely necessary and I will explain a workaround at the appropriate place. Also, I use Microsoft Windows XP. Part of this procedure depends on the behavior of the Windows clipboard. I cannot guarantee that it will work in Mac OS or Linux. However, some of the CSS editing tricks will work for anyone.

Basic Procedure

  1. Copy and paste the Document from word into the Blogger editor compose tab.
  2. Isolate the imported Word document by surrounding its HTML code with a <span class="UniqueName"> and </span> set of tags in the Blogger editor’s HTML tab.
  3. Fix the Word CSS stylesheet by inserting "span.UniqueName " before all of the CSS selectors (without the quotes but with the trailing space), thereby creating descendant selectors that will only apply within the imported Word document and will not affect the formatting of the rest of your blog.

Detailed Procedure

  1. Open the Word document.
  2. Open the Blogger editor to the HTML tab.
  3. If you want to type some introductory text, explaining a bit about what the paper is about or why you wrote it then do it in the Blogger editor now. Surround each paragraph with <p></p> tags.
    • If I do this step then I also put a horizontal rule <hr/> under my initial comments and before the actual paper or Word document I intend to import.
  4. Prepare the Word Document:
    • Often academic papers will have been formatted with double line spacing. Usually this does not read well online so you will want to remove it. If the spacing is set in the styles then you should edit the styles rather than simply selecting all of the text and setting it to single line spacing. The latter results in Word placing a style code in each and every major tag in the document.
    • Text boxes get exported as pictures so, if there are any of these in your document that need to pull the text out of the text box and place it where it will be appropriate within the main body of the document.
    • Remember, all you have to do is not save the modified Word document and these changes will not hurt anything.
  5. Select all of the text in the Word document that you want in your blog post and copy it.
  6. Go to the Blogger post editor and switch to the Compose tab.
  7. Place the cursor under any introductory text you may have entered in Step 3 and paste the Word document there.
    • When you do this paste either Word or the Windows Clipboard will automatically convert the Word formatting to CSS styles and assign HTML element class names to each of the paragraphs which will be enclosed in <p> tags.
    • If you now switch to the Blogger editor’s HTML tab you will be able to see quite a lot of extraneous HTML elements. Some we will keep and some we will either get rid of or ignore. Part of this mess is the CSS stylesheet for this particular document. You will notice that it is now contained in what will be the body of your final blog web page rather than within the <head> tag as is normally the case. This is apparently OK. It is also what will allow your post to keep the same formatting as the original document.
    • If you were to publish your post at this point it is entirely possible that some of the CSS styles listed in Word’s CSS stylesheet would conflict with the CSS stylesheet in your Blogger template. The next steps will rectify that situation.
  8. Clean up extraneous HTML tags.
    • If you have DreamWeaver you can use it to modify the HTML that resulted from the previous paste.
      1. Copy all the HTML from the Blogger editor HTML tab and paste it into a new DreamWeaver HTML document between the <body> and </body> tags in the code editor (NOT the design view).
      2. In DreamWeaver choose { Commands / Clean Up Word HTML… ; <basic> } and select the following check boxes:
        • Remove all word specific markup
        • Clean up <font> tags
        • Fix invalidly nested tags
        • Apply source formatting
        • (Do NOT select “Clean up CSS.”)
      1. Click [OK].
      2. Select all of the HTML code between the <body> and </body> tags in the DreamWeaver code editor.
      3. Copy that and paste it in place of ALL the HTML code that is in the Blogger editor’s HTML tab.
    • If you do not have DreamWeaver you can safely ignore the extraneous HTML tags if you want. Or you can edit out whatever bits that Blogger complains about when you try to post later. I have found that anything between and tags can go. Also all of the meta tags can go. This is what Blogger will complain about the most.
  1. Fix the CSS so that Word’s CSS styles do not conflict with Blogger’s CSS styles.
    • You may not even have to do anything for this step. Most of the HTML element class names and associated CSS selectors created by Word have unique names that will be very unlikely to conflict with the CSS in your Blogger template. In fact, if all of the styles in your Word document were ones you created yourself then it is almost certain that none of their names will conflict. However, if you simply used the Normal style or any of the heading styles without creating your own, uniquely named, version of them then you will probably have problems. This is because Word sometimes uses a simple p, h1, or h2 element selector in the CSS styles. Since most Blogger templates use the <div> tags with specific element id attributes to associate HTML elements with CSS styles, the simple p or h1 selectors in the Word CSS stylesheet takes precedence.
    • There is an easy way to check for these problems. Simply publish your post then look at your blog. Check to see if any of the formatting of the text outside of your post has changed. Check both on your blog’s home page and on the post’s individual page. If nothing is wrong then you are done.
    • If some of the CSS styles have conflicted then follow these steps:
      1. Go to the Blogger editor’s HTML tab and surround all of the HTML code in the post with a <span class="UniqueName"> and </span> set of tags.
        • This sets the post off with it’s own unique class that will apply to the entire imported Word document but not to any of the rest of the blog page or any of your other posts. If you entered introductory text in Step 3 then you may want to place the first <span> tag just after that text and the optional <hr/> tag. This will cause that introductory text to retain the same styling as the rest of your blog, thereby visually setting it apart from the text imported from Word.
        • You must use a unique class name for each of your posts so that one post won't interfere with another.
      2. Look for Word’s CSS stylesheet near the top of the post.
        • If you were able to use DreamWeaver to clean up the HTML then it will be right at the top of your HTML code (or right under your introductory text from Step 3). If you are choosing to just ignore the extraneous HTML then the part you want may be buried a little bit from the top. It will be between a <style> and </style> tag and NOT within any and tags.
        • You will notice that this CSS stylesheet is not formatted nicely at all. All the CSS rules are just strung together on one line of text.
      3. Insert "span.UniqueName " before each of the CSS selectors.
        • Do not insert the quotes but make sure to include the space between the inserted text and the existing selector.
        • This creates what they call a descendent selector. This means that the simple selector will only apply if it is also found buried somewhere within the specified ancestor element. In this case this means that the simple element selector will only apply to elements that are inside our <span class="UniqueName">and</span>set of tags, which means only within our imported Word document.
        • Remember to insert the "span.UniqueName " before each selector in grouped selectors separately or it will only apply to the first one. (Grouped selectors are a list of selectors separated by commas.)
        • Word puts HTML comments around its CSS stylesheet. In DreamWeaver this makes it all light gray. It is safe to remove the HTML comments from around the CSS rules within the <style> tags. This allows DreamWeaver to use its syntax highlighting which makes it much easier to find all of the CSS selectors.

Pictures

If your word document has any pictures in it then you will have to do “a few” additional steps. Again, it goes pretty quick once you get used to it.

  1. In Word { File / Save as Web Page… ; File name: = “whatever.htm” ; Save as type: = “Web Page (*.htm, *.html)”[v] ; [OK] }.
    • This is just to get the pictures out of the file quickly and easily. You should not use the resulting HTML code to paste into Blogger. For some reason copying and pasting from Word into Blogger’s Compose tab cleans up a few things that are difficult to clean up any other way. (At least with my limited knowledge.)
    • Word will create a folder and stick all the files in it.
    • You can just save this to the desktop because you will be deleting it when you are through.
  2. Go through the image files that Word created and rename them to something more meaningful.
  3. Upload all these images to your preferred file or image hosting service. You could just use Picassa if you want.
  4. Go through all the HTML that was the result of pasting the document into Blogger and edit the <img> tags one by one.
    1. Locate the uploaded image on your image hosting service.
    2. Right click on it and choose “Copy Image Location” or some similar menu item.
      • You want to copy the image’s URL to the clipboard. Sometimes copying the URL from the browser’s address bar does not give the correct URL. Always get it by right clicking on the image itself.
    3. In the appropriate <img> tag select all of the text between the quotes for the src attribute and paste the new URL there.

Notes

  • If you look at the source for this page you will see that it was originally created in Word. While I could have created it directly in DreamWeaver, I wanted you to be able to look at the code as an example.
  • I have found that it is easier to do all of this editing in DreamWeaver. I just copy all the HTML from Blogger’s HTML tab and paste it between the <body> tags in a new DreamWeaver HTML file. Then it is easy to resize the images if necessary without guessing at widths and heights. If I have to resize the image appreciably (which may be necessary if your blog’s main <div> is pretty narrow as in many templates) then I will then turn that image into a link to the full size version of the image for the user’s convenience. All you have to do is paste that same URL into the link field in the image properties bar in DreamWeaver and it automatically wraps the image in an <a> tag. When I am finished I just copy all the HTML code between the <body> tags and paste it into Blogger’s HTML tab, replacing the previous code.
  • In some of these instructions I have used Grant's Concise GUI Notation System (GCGUINS) which I have described in a separate blog post.
  • I should warn you that things do not always come out peachy keen.
    • If you have a deeply indented hierarchical list in your Word document then some levels of it will get turned into HTML lists and some of them will simply be turned into paragraphs with <p> tags instead.
    • If you have any sample HTML tags in your document (like I have lots of in this one) then pasting the text to Blogger's Compose tab will not convert those to appropriate HTML entities (&lt; or &gt; etc.). You will have to dig through the HTML code and switch them around. The easiest way to do that is to cut them from the code window in DreamWeaver and then paste them in the Design window. DreamWeaver will then convert them to the appropriate entities in the HTML code so they show up properly in the post.
    • If you subsequently edit your post in the Blogger editor's Compose tab then it may strip out some or all of the formatting. So, once you have done this you should always use Blogger's HTML tab or an external HTML editor to edit the post rather than the Compose tab.
  • I am sure that others will find many other problems with this technique. Sometimes it goes quickly and sometimes, as with this particular document, it takes a lot of additional editing. Only you can decide whether it will be faster to use this technique or to completely reformat your document in HTML. Feel free to post comments here or in the Blogger Help Forum about any other problems you find.
  • Even if you don't copy and paste from Word some of the other tricks here could help.
    • Using the <span class="UniqueName"> and <span> tags around the entire post to isolate it and then using span.UniqueName to create descendent selectors in your CSS will allow you to use different formatting in your post.
    • If you no longer have access to the original pictures used in your document then the technique shown here can get them out quickly and easily.

This post is Copyright © 2009 by Grant Sheridan Robertson.

2 comments:

  1. hi, thanks for the helpful explanation
    though, i didn't understand everything...
    the thing is i'm looking to publicize the document as a template (meaning, when someone opens the blog, he can click on a link and open the document and even edit it
    any idea how to do that?

    ReplyDelete
  2. Dear Anonymous,

    This post is about converting a Word document into an HTML format suitable for posting as a Blogger blog post. What you are trying to do is completely different. You simply want people to be able to click a link and download your file. This should be approached the same as with any other file. A Word document template is no different from any other file in this respect. You should search the Blogger help files for information about "posting a file" or "linking to a file" or "link to document."

    One thing I can tell you is that Blogger is not like a regular web site. You don't get a specific directory on a server somewhere, to which you can upload files and then create links to them. This means you must put the file somewhere else and then create a link to that. There are many posts in the Blogger help files and forums on this topic so I won't repeat it all here.

    If you don't know how to create links to files, it works the exact same way as for any other HTML. Just look for "HTML file link" or something like that.

    ReplyDelete