Converting Microsoft Word Files (doc, docx) to reStructuredText (rst)

This article describes how to convert Microsoft Word documents to reStructuredText. Everything should be done within a temporary directory with simplified filenames. So let’s assume you want to convert ‘am.docx’ to reStructuredText. The Word document can contain images.
You need:
A few simple steps:
  1. On the command line (either the old cmd or the PowerShell) go to the temporary directory that contains the Word document (e.g. C:\temp):
    cd c:\temp
  2. Convert ‘am.docx’ to ‘am.rst’ using pandoc
    pandoc.exe -f docx am.docx -t rst -o am.rst
  3. Extract the media files (e.g. images) from the Word document
    unzip .\am.docx

    and move it to current working directory

    mv .\word\media .
  4. All image files should be in the same file format, so convert eml and gif files to png.
    cd media

    to jump into the directory

    dir (to list all files)

    a) Either by hand:

    convert .\image2.gif .\image2.png
    convert .\image1.emf .\image1.png

    b) Or automatically by using mogrify (also part of ImageMagick):

    mogrify.exe -format png *.emf
    mogrify.exe -format png *.gif

    And clean up:

  5. rm *.gif
    rm *.emf
  6. Do not forget to search and replace .emf and .gif with .png in the .rst file with the editor of your choice (gvim or notepad++)
  7. Check the build by creating a quick Sphinx:
    run sphinx-quickstart (and follow the instructions)
    copy the file over the main doc in the source dir
    copy the media folder to source
    run “make.bat html” to create the a website and check the result.

Author: mpeintinger

Studied chemistry, practices Taekwondo, plays board gamess, loves science and technology.

Leave a Reply

Your email address will not be published. Required fields are marked *