This article describes how to convert Microsoft Word documents to reStructuredText. Everything should be done within a temporary directory with simplified filenames. So let’s assume you want to convert ‘am.docx’ to reStructuredText. The Word document can contain images.
You need:
- pandoc https://github.com/jgm/pandoc/releases/tag/1.16.0.2
- Microsoft Word
- ImageMagick (for image file conversion) http://www.imagemagick.org/script/binary-releases.php
- To test the rst file: Python (https://www.python.org/downloads/windows/) Sphinx (pip install -U Sphinx)
A few simple steps:
- On the command line (either the old cmd or the PowerShell) go to the temporary directory that contains the Word document (e.g. C:\temp):
cd c:\temp
- Convert ‘am.docx’ to ‘am.rst’ using pandoc
pandoc.exe -f docx am.docx -t rst -o am.rst
- Extract the media files (e.g. images) from the Word document
unzip .\am.docx
and move it to current working directory
mv .\word\media .
- All image files should be in the same file format, so convert eml and gif files to png.
cd media
to jump into the directory
dir (to list all files)
a) Either by hand:
convert .\image2.gif .\image2.png convert .\image1.emf .\image1.png
b) Or automatically by using mogrify (also part of ImageMagick):
mogrify.exe -format png *.emf mogrify.exe -format png *.gif
And clean up:
-
rm *.gif rm *.emf
- Do not forget to search and replace .emf and .gif with .png in the .rst file with the editor of your choice (gvim or notepad++)
- Check the build by creating a quick Sphinx:
run sphinx-quickstart (and follow the instructions)
copy the file over the main doc in the source dir
copy the media folder to source
run “make.bat html” to create the a website and check the result.