March 1, 2016March 1, 2016 by mpeintinger

Converting Microsoft Word Files (doc, docx) to reStructuredText (rst)

This article describes how to convert Microsoft Word documents to reStructuredText. Everything should be done within a temporary directory with simplified filenames. So let’s assume you want to convert ‘am.docx’ to reStructuredText. The Word document can contain images.

You need:

pandoc https://github.com/jgm/pandoc/releases/tag/1.16.0.2
Microsoft Word
ImageMagick (for image file conversion) http://www.imagemagick.org/script/binary-releases.php
To test the rst file: Python (https://www.python.org/downloads/windows/) Sphinx (pip install -U Sphinx)

A few simple steps:

On the command line (either the old cmd or the PowerShell) go to the temporary directory that contains the Word document (e.g. C:\temp):
```
cd c:\temp
```
Convert ‘am.docx’ to ‘am.rst’ using pandoc
```
pandoc.exe -f docx am.docx -t rst -o am.rst
```
Extract the media files (e.g. images) from the Word document
```
unzip .\am.docx
```
and move it to current working directory
```
mv .\word\media .
```
All image files should be in the same file format, so convert eml and gif files to png.
```
cd media
```
to jump into the directory
```
dir (to list all files)
```
a) Either by hand:
```
convert .\image2.gif .\image2.png
convert .\image1.emf .\image1.png
```
b) Or automatically by using mogrify (also part of ImageMagick):
```
mogrify.exe -format png *.emf
mogrify.exe -format png *.gif
```
And clean up:
```
rm *.gif
rm *.emf
```
Do not forget to search and replace .emf and .gif with .png in the .rst file with the editor of your choice (gvim or notepad++)
Check the build by creating a quick Sphinx:
run sphinx-quickstart (and follow the instructions)
copy the file over the main doc in the source dir
copy the media folder to source
run “make.bat html” to create the a website and check the result.

11 Replies to “Converting Microsoft Word Files (doc, docx) to reStructuredText (rst)”

GKLOUSE says:

September 14, 2018 at 8:06 am

This was extremely useful for me at just the right moment when I needed it. I wanted to convey my appreciation. THANKS!

Reply
1. mpeintinger says:
  
  September 19, 2018 at 5:12 pm
  
  You’re welcome. Let me know if you have questions. Maybe I can help.
  
  Reply
  1. Yelena says:
    
    September 6, 2019 at 10:58 am
    
    Is there a similar procedure for converting rst back to doc/docx format?
    
    Reply
    1. mpeintinger says:
      
      September 6, 2019 at 10:10 am
      
      Well not a similar procedure, but when you build the restructured text you have options for building. I would build a rtf file, open that with MS Word and store it from there as a doc or doc file.
      
      Reply
Lynndy says:

February 15, 2019 at 8:39 am

This was helpful so far, but how to save embedded images into specific folders without default media folder? and update inline image url synchronously? THANK YOU!
Looked like none can do that so far.

Reply
1. mpeintinger says:
  
  February 15, 2019 at 10:28 am
  
  I haven’t tried that yet. So I don’t know.
  
  Reply
2. Dan Hessler says:
  
  March 6, 2020 at 1:18 pm
  
  I was able to do that with some python (I run the pandoc from a python script i’m using for document conversion)
  
  Here you go:
  
  pattern = re.compile(” media/”)
  replacement = ” ” + file + “/”
  filename = file + ‘.rst’
  
  with open (filename, ‘rb’) as f:
  content = f.read().decode(“utf-8”)
  
  # replace media/ with ‘ filename/’
  content_new = re.sub(pattern, replacement, content)
  
  # write content_new to file
  with open(filename, ‘wb’) as new_output:
  new_output.write(content_new.encode(“utf-8”))
  
  Reply
Dan Hessler says:

March 6, 2020 at 1:19 pm

Any idea how to get syntax highlighting from code samples in the word doc to transfer over to the RST? I cannot figure out how to go about this.

Reply
Luca says:

December 9, 2021 at 5:29 am

Hi,
I’m trying to convert old e-mail (*.pdf) to *.eml for import in Outlook or similar.

In origin:
– someone in my office export e-mail from Lotus Notes in *.pdf portfolio

Now:
– I’m able to export single e-mail (4300!!) from portfolio in *.pdf format
– in each *.pdf there are metadata
– i can convert *.pdf in *.docx
– with RecoveryTools DOCX Migrator I can convert *.docx to *.eml

So, I have one problem with the last convertion: software create file *.eml with all the text of *.docx in the body of the e-mail (also data, sender, etc.).

Is it possibile to create one script to extract metadata (data, sender, etc.) from *.pdf / *.docx and create *.eml file?

Thank you for your help,

Best regards
Luca

Reply
Luca says:

December 9, 2021 at 6:10 am

P.S. I see metadata of *.pdf (data, sender, etc.) with software Advanced Renamer v. 3.88.

Reply
Simon says:

March 4, 2022 at 11:05 am

Hi!
Thanks for that!
I found out that pandoc in the aktual version support that option “–extract-media=DIR”. this allows to extract media files directly without do it by your own. https://pandoc.org/MANUAL.html
it also adepts the filepath in the rst fitting to that path. Here a smal batch for windows which converts all .docx in the folder its called to .rst and extract mediafiles in a folder named [filename]_media.

FOR %%i IN (“%~dp0*.docx”) do (
pandoc.exe -f docx “%%~dpi%%~ni”.docx -t rst -o “%%~dpi%%~ni”.rst –extract-media=”./%%~ni_media”
)

maybe anybody needs the same

Reply