This article describes how to convert Microsoft Word documents to reStructuredText. Everything should be done within a temporary directory with simplified filenames. So let’s assume you want to convert ‘am.docx’ to reStructuredText. The Word document can contain images.
You need:
- pandoc https://github.com/jgm/pandoc/releases/tag/1.16.0.2
- Microsoft Word
- ImageMagick (for image file conversion) http://www.imagemagick.org/script/binary-releases.php
- To test the rst file: Python (https://www.python.org/downloads/windows/) Sphinx (pip install -U Sphinx)
A few simple steps:
- On the command line (either the old cmd or the PowerShell) go to the temporary directory that contains the Word document (e.g. C:\temp):
cd c:\temp
- Convert ‘am.docx’ to ‘am.rst’ using pandoc
pandoc.exe -f docx am.docx -t rst -o am.rst
- Extract the media files (e.g. images) from the Word document
unzip .\am.docx
and move it to current working directory
mv .\word\media .
- All image files should be in the same file format, so convert eml and gif files to png.
cd media
to jump into the directory
dir (to list all files)
a) Either by hand:
convert .\image2.gif .\image2.png convert .\image1.emf .\image1.png
b) Or automatically by using mogrify (also part of ImageMagick):
mogrify.exe -format png *.emf mogrify.exe -format png *.gif
And clean up:
-
rm *.gif rm *.emf
- Do not forget to search and replace .emf and .gif with .png in the .rst file with the editor of your choice (gvim or notepad++)
- Check the build by creating a quick Sphinx:
run sphinx-quickstart (and follow the instructions)
copy the file over the main doc in the source dir
copy the media folder to source
run “make.bat html” to create the a website and check the result.
This was extremely useful for me at just the right moment when I needed it. I wanted to convey my appreciation. THANKS!
You’re welcome. Let me know if you have questions. Maybe I can help.
Is there a similar procedure for converting rst back to doc/docx format?
Well not a similar procedure, but when you build the restructured text you have options for building. I would build a rtf file, open that with MS Word and store it from there as a doc or doc file.
This was helpful so far, but how to save embedded images into specific folders without default media folder? and update inline image url synchronously? THANK YOU!
Looked like none can do that so far.
I haven’t tried that yet. So I don’t know.
I was able to do that with some python (I run the pandoc from a python script i’m using for document conversion)
Here you go:
pattern = re.compile(” media/”)
replacement = ” ” + file + “/”
filename = file + ‘.rst’
with open (filename, ‘rb’) as f:
content = f.read().decode(“utf-8”)
# replace media/ with ‘ filename/’
content_new = re.sub(pattern, replacement, content)
# write content_new to file
with open(filename, ‘wb’) as new_output:
new_output.write(content_new.encode(“utf-8”))
Any idea how to get syntax highlighting from code samples in the word doc to transfer over to the RST? I cannot figure out how to go about this.