Converting Microsoft Word Files (doc, docx) to reStructuredText (rst)

This article describes how to convert Microsoft Word documents to reStructuredText. Everything should be done within a temporary directory with simplified filenames. So let’s assume you want to convert ‘am.docx’ to reStructuredText. The Word document can contain images.
You need:
A few simple steps:
  1. On the command line (either the old cmd or the PowerShell) go to the temporary directory that contains the Word document (e.g. C:\temp):
    cd c:\temp
  2. Convert ‘am.docx’ to ‘am.rst’ using pandoc
    pandoc.exe -f docx am.docx -t rst -o am.rst
  3. Extract the media files (e.g. images) from the Word document
    unzip .\am.docx

    and move it to current working directory

    mv .\word\media .
  4. All image files should be in the same file format, so convert eml and gif files to png.
    cd media

    to jump into the directory

    dir (to list all files)

    a) Either by hand:

    convert .\image2.gif .\image2.png
    convert .\image1.emf .\image1.png

    b) Or automatically by using mogrify (also part of ImageMagick):

    mogrify.exe -format png *.emf
    mogrify.exe -format png *.gif

    And clean up:

  5. rm *.gif
    rm *.emf
  6. Do not forget to search and replace .emf and .gif with .png in the .rst file with the editor of your choice (gvim or notepad++)
  7. Check the build by creating a quick Sphinx:
    run sphinx-quickstart (and follow the instructions)
    copy the file over the main doc in the source dir
    copy the media folder to source
    run “make.bat html” to create the a website and check the result.

11 Replies to “Converting Microsoft Word Files (doc, docx) to reStructuredText (rst)”

        1. Well not a similar procedure, but when you build the restructured text you have options for building. I would build a rtf file, open that with MS Word and store it from there as a doc or doc file.

  1. This was helpful so far, but how to save embedded images into specific folders without default media folder? and update inline image url synchronously? THANK YOU!
    Looked like none can do that so far.

    1. I was able to do that with some python (I run the pandoc from a python script i’m using for document conversion)

      Here you go:

      pattern = re.compile(” media/”)
      replacement = ” ” + file + “/”
      filename = file + ‘.rst’

      with open (filename, ‘rb’) as f:
      content = f.read().decode(“utf-8”)

      # replace media/ with ‘ filename/’
      content_new = re.sub(pattern, replacement, content)

      # write content_new to file
      with open(filename, ‘wb’) as new_output:
      new_output.write(content_new.encode(“utf-8”))

  2. Any idea how to get syntax highlighting from code samples in the word doc to transfer over to the RST? I cannot figure out how to go about this.

  3. Hi,
    I’m trying to convert old e-mail (*.pdf) to *.eml for import in Outlook or similar.

    In origin:
    – someone in my office export e-mail from Lotus Notes in *.pdf portfolio

    Now:
    – I’m able to export single e-mail (4300!!) from portfolio in *.pdf format
    – in each *.pdf there are metadata
    – i can convert *.pdf in *.docx
    – with RecoveryTools DOCX Migrator I can convert *.docx to *.eml

    So, I have one problem with the last convertion: software create file *.eml with all the text of *.docx in the body of the e-mail (also data, sender, etc.).

    Is it possibile to create one script to extract metadata (data, sender, etc.) from *.pdf / *.docx and create *.eml file?

    Thank you for your help,

    Best regards
    Luca

  4. Hi!
    Thanks for that!
    I found out that pandoc in the aktual version support that option “–extract-media=DIR”. this allows to extract media files directly without do it by your own. https://pandoc.org/MANUAL.html
    it also adepts the filepath in the rst fitting to that path. Here a smal batch for windows which converts all .docx in the folder its called to .rst and extract mediafiles in a folder named [filename]_media.

    FOR %%i IN (“%~dp0*.docx”) do (
    pandoc.exe -f docx “%%~dpi%%~ni”.docx -t rst -o “%%~dpi%%~ni”.rst –extract-media=”./%%~ni_media”
    )

    maybe anybody needs the same

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.