socrtwo's Avatar
socrtwo 4
18 Asked
50 Answered
5 Best
0
No one has voted on this question yet :(
3 years, 3 months ago

Is there a good open source program for repairing malformed XML?

I'm trying to build an application which will extract the text from the document.xml part of a corrupted Word 2007/docx file. XML extractor modules seem to require well formed XML and if Word can't extract the text it usually means the document.xml part is partial and no longer well formed.
Tip for best answer: M$2.00
Separate topics with commas, or by pressing return. Use the delete or backspace key to edit or remove existing topics.

You can leave an optional "tip" with Mahalo's virtual currency, Mahalo Dollars. If you are asking a difficult question that might require some research, or if you'd like a wide variety of feedback, a higher tip often leads to more answers to your question.

M$

What is Your Answer?

0
0
0

2 Answers

1
shakespearegeek's Avatar
shakespearegeek | 3 years, 3 months ago
4
Wow.....yikes. Hmmm. The unfortunate part is that if an XML document is not "well-formed", then as far as the parser is concerned, it might as well not even be an XML document at all. It either is or isn't, I know of no parsers that can say something like "Close enough."

In Unix, in a situation like this, I think I would head for the "strings" utility which scrapes out everythign that looks like text (i.e. 7 bit character strings of meaningful length) from a binary (i.e. 8bit) file. If the XML portion you seek is stored in a fairly unencrypted way inside the docx file, you could scrape the text out that way and see what it takes to do a manual repair?

Sorry, that's the best I've got for now. I've attached a link to a version of strings for Windows if you need it. If I think of anything else I'll come back and comment on myself.

UPDATE : Just tried strings on a typical docx file and I don't think that's what you need. Sorry I could not be of more help. Are you saying that you do in fact have a document.xml file? A text file that is just not well formed? Have you tried hitting it with a SAX parser or StAX (streaming) parser, which will only look at elements as they come up and might give you a higher quality parse?

You can leave an optional "tip" with Mahalo's virtual currency, Mahalo Dollars. If you are asking a difficult question that might require some research, or if you'd like a wide variety of feedback, a higher tip often leads to more answers to your question.

M$
socrtwo's Avatar
socrtwo | 3 years, 3 months ago Report

I found a solution:

1. Repair the docx zip with a zip repair function of a zip program like the one found in Ccy's HaHa Zip.

2. Extract the document.xml from the zip file.

3. Take that most often damaged document.xml and file and then run it through Tidy HTML with it's default options as presented here: http://infohound.net/tidy/

Tidy HTML recovers the text well and puts it into an HTML. The HTML is not formatted so will arrive in just one block of text in the web page.

socrtwo's Avatar
socrtwo | 3 years, 3 months ago Report

Also here's even an application that does exactly what we are discussing, but it did not work on my corrupt documents. Perhaps it needs an added zip repair functionality: http://www.codeproject.com/KB/office/ExtractTextFromDOCXs.aspx

Report Abuse

Post Reply Cancel
0
tunacrust's Avatar
tunacrust | 3 years, 3 months ago
3
XML Doctor is a smart XML editor for Windows that works with your DTD or schema files to help you edit or repair your document, using an intiutive tree view. Pick child elements and attributes from a list. Errors are flagged in red.

http://sourceforge.net/projects/xmldoctor/

You can leave an optional "tip" with Mahalo's virtual currency, Mahalo Dollars. If you are asking a difficult question that might require some research, or if you'd like a wide variety of feedback, a higher tip often leads to more answers to your question.

M$
socrtwo's Avatar
socrtwo | 3 years, 3 months ago Report

The two corrupt document.xml files I have won't open in XML Doctor. I get fatal errors, however thanks.

Report Abuse

Post Reply Cancel

Learn something new with our FREE educational apps!

Private lessons in the comfort of your own home. Get back in shape or finally pick up a guitar with our great experts guiding you the whole way!
Learn Guitar
Learn Hip Hop
Learn Pilates