Ask questions via twitter! Message any question to @answers on twitter. We'll publish the question and send you a reply each time there's a new answer.
Next Question

Answered Question

 
M$2 February 20, 2009 01:59 AM

Is there a good open source program for repairing malformed XML?

I'm trying to build an application which will extract the text from the document.xml part of a corrupted Word 2007/docx file. XML extractor modules seem to require well formed XML and if Word can't extract the text it usually means the document.xml part is partial and no longer well formed.
Interesting Question?  Yes (0)   No (0)   
RSS
 
 

Best Answer  Chosen by Asker

 
February 20, 2009 09:47 PM
Wow.....yikes. Hmmm. The unfortunate part is that if an XML document is not "well-formed", then as far as the parser is concerned, it might as well not even be an XML document at all. It either is or isn't, I know of no parsers that can say something like "Close enough."

In Unix, in a situation like this, I think I would head for the "strings" utility which scrapes out everythign that looks like text (i.e. 7 bit character strings of meaningful length) from a binary (i.e. 8bit) file. If the XML portion you seek is stored in a fairly unencrypted way inside the docx file, you could scrape the text out that way and see what it takes to do a manual repair?

Sorry, that's the best I've got for now. I've attached a link to a version of strings for Windows if you need it. If I think of anything else I'll come back and comment on myself.

UPDATE : Just tried strings on a typical docx file and I don't think that's what you need. Sorry I could not be of more help. Are you saying that you do in fact have a document.xml file? A text file that is just not well formed? Have you tried hitting it with a SAX parser or StAX (streaming) parser, which will only look at elements as they come up and might give you a higher quality parse?
Source(s):
http://blog.stevienova.com/2004/07/15/strings-for-windows-xp/

Asker's Rating:


Helpful Answer?  (1)   (0)    Tip shakespearegeek for this answer
Permalink | Report
   Reply  
 
 
 
February 21, 2009 10:17 PM
I found a solution:

1. Repair the docx zip with a zip repair function of a zip program like the one found in Ccy's HaHa Zip.

2. Extract the document.xml from the zip file.

3. Take that most often damaged document.xml and file and then run it through Tidy HTML with it's default options as presented here: http://infohound.net/tidy/

Tidy HTML recovers the text well and puts it into an HTML. The HTML is not formatted so will arrive in just one block of text in the web page.

Report
 
 
 
February 22, 2009 02:08 AM
Also here's even an application that does exactly what we are discussing, but it did not work on my corrupt documents. Perhaps it needs an added zip repair functionality: http://www.codeproject.com/KB/office/ExtractTextFromDOCXs.aspx

Report
 
 

Other Answers (1)

Sort By
 
February 20, 2009 03:01 AM
XML Doctor is a smart XML editor for Windows that works with your DTD or schema files to help you edit or repair your document, using an intiutive tree view. Pick child elements and attributes from a list. Errors are flagged in red.

http://sourceforge.net/projects/xmldoctor/
Source(s):
http://sourceforge.net/projects/xmldoctor/


Helpful Answer?  (0)   (0)    Tip tunacrust for this answer
Permalink | Report
   Reply  
 
 
 
February 20, 2009 03:11 AM
The two corrupt document.xml files I have won't open in XML Doctor. I get fatal errors, however thanks.

Report
 
 

Answer this Question

How tips and payments work

This question has already been resolved. You may add an answer to it but you will not be eligible to win best answer or any associated tips.

Ask a Question


140 characters left
Top of Page
Buy Mahalo Dollars with Credit Card or PayPal

Top Members

This Week All Time
  • cfinke
    cfinke
    2nd Degree Black Belt
    28763 Points
    M$29.75 Earned
  • bunnyphuph...
    bunnyphuph...
    2nd Degree Black Belt
    21860 Points
    M$774.74 Earned
  • opher
    opher
    Purple Belt with a Brown Tip
    6470 Points
    M$246.24 Earned
   See All
 

Most Popular Tags

mahalo(1824)
music(517)
iphone(495)
google(397)
online(379)
food(349)
money(298)
beer(295)
movies(288)
apple(262)
aotd(235)
free(233)
video(232)
health(232)
dog(219)
games(212)
   See All
 

Categories

Welcome New Members


 
 
Mahalo Dollars are the currency of Mahalo Answers.

Each Mahalo Dollar costs $1.

Once you earn more than 40 Mahalo Dollars, you can request to be paid via PayPal. Each Mahalo Dollar is currently worth $0.75 when paid out via PayPal. Learn More

 
 

Please log in to use this function.