This article describes how to convert Microsoft Word documents to reStructuredText. Everything should be done within a temporary directory with simplified filenames. So let’s assume you want to convert ‘am.docx’ to reStructuredText. The Word document can contain images.
- pandoc https://github.com/jgm/pandoc/releases/tag/126.96.36.199
- Microsoft Word
- ImageMagick (for image file conversion) http://www.imagemagick.org/script/binary-releases.php
- To test the rst file: Python (https://www.python.org/downloads/windows/) Sphinx (pip install -U Sphinx)
A few simple steps:
- On the command line (either the old cmd or the PowerShell) go to the temporary directory that contains the Word document (e.g. C:\temp):
- Convert ‘am.docx’ to ‘am.rst’ using pandoc
pandoc.exe -f docx am.docx -t rst -o am.rst
- Extract the media files (e.g. images) from the Word document
and move it to current working directory
mv .\word\media .
- All image files should be in the same file format, so convert eml and gif files to png.
to jump into the directory
dir (to list all files)
a) Either by hand:
convert .\image2.gif .\image2.png convert .\image1.emf .\image1.png
b) Or automatically by using mogrify (also part of ImageMagick):
mogrify.exe -format png *.emf mogrify.exe -format png *.gif
And clean up:
rm *.gif rm *.emf
- Do not forget to search and replace .emf and .gif with .png in the .rst file with the editor of your choice (gvim or notepad++)
- Check the build by creating a quick Sphinx:
run sphinx-quickstart (and follow the instructions)
copy the file over the main doc in the source dir
copy the media folder to source
run “make.bat html” to create the a website and check the result.
11 Replies to “Converting Microsoft Word Files (doc, docx) to reStructuredText (rst)”
This was extremely useful for me at just the right moment when I needed it. I wanted to convey my appreciation. THANKS!
You’re welcome. Let me know if you have questions. Maybe I can help.
Is there a similar procedure for converting rst back to doc/docx format?
Well not a similar procedure, but when you build the restructured text you have options for building. I would build a rtf file, open that with MS Word and store it from there as a doc or doc file.
This was helpful so far, but how to save embedded images into specific folders without default media folder? and update inline image url synchronously? THANK YOU!
Looked like none can do that so far.
I haven’t tried that yet. So I don’t know.
I was able to do that with some python (I run the pandoc from a python script i’m using for document conversion)
Here you go:
pattern = re.compile(” media/”)
replacement = ” ” + file + “/”
filename = file + ‘.rst’
with open (filename, ‘rb’) as f:
content = f.read().decode(“utf-8”)
# replace media/ with ‘ filename/’
content_new = re.sub(pattern, replacement, content)
# write content_new to file
with open(filename, ‘wb’) as new_output:
Any idea how to get syntax highlighting from code samples in the word doc to transfer over to the RST? I cannot figure out how to go about this.
I’m trying to convert old e-mail (*.pdf) to *.eml for import in Outlook or similar.
– someone in my office export e-mail from Lotus Notes in *.pdf portfolio
– I’m able to export single e-mail (4300!!) from portfolio in *.pdf format
– in each *.pdf there are metadata
– i can convert *.pdf in *.docx
– with RecoveryTools DOCX Migrator I can convert *.docx to *.eml
So, I have one problem with the last convertion: software create file *.eml with all the text of *.docx in the body of the e-mail (also data, sender, etc.).
Is it possibile to create one script to extract metadata (data, sender, etc.) from *.pdf / *.docx and create *.eml file?
Thank you for your help,
P.S. I see metadata of *.pdf (data, sender, etc.) with software Advanced Renamer v. 3.88.
Thanks for that!
I found out that pandoc in the aktual version support that option “–extract-media=DIR”. this allows to extract media files directly without do it by your own. https://pandoc.org/MANUAL.html
it also adepts the filepath in the rst fitting to that path. Here a smal batch for windows which converts all .docx in the folder its called to .rst and extract mediafiles in a folder named [filename]_media.
FOR %%i IN (“%~dp0*.docx”) do (
pandoc.exe -f docx “%%~dpi%%~ni”.docx -t rst -o “%%~dpi%%~ni”.rst –extract-media=”./%%~ni_media”
maybe anybody needs the same