Is there a good open source program for repairing malformed XML?
You can leave an optional "tip" with Mahalo's virtual currency, Mahalo Dollars. If you are asking a difficult question that might require some research, or if you'd like a wide variety of feedback, a higher tip often leads to more answers to your question.
M$2 Answers
In Unix, in a situation like this, I think I would head for the "strings" utility which scrapes out everythign that looks like text (i.e. 7 bit character strings of meaningful length) from a binary (i.e. 8bit) file. If the XML portion you seek is stored in a fairly unencrypted way inside the docx file, you could scrape the text out that way and see what it takes to do a manual repair?
Sorry, that's the best I've got for now. I've attached a link to a version of strings for Windows if you need it. If I think of anything else I'll come back and comment on myself.
UPDATE : Just tried strings on a typical docx file and I don't think that's what you need. Sorry I could not be of more help. Are you saying that you do in fact have a document.xml file? A text file that is just not well formed? Have you tried hitting it with a SAX parser or StAX (streaming) parser, which will only look at elements as they come up and might give you a higher quality parse?
You can leave an optional "tip" with Mahalo's virtual currency, Mahalo Dollars. If you are asking a difficult question that might require some research, or if you'd like a wide variety of feedback, a higher tip often leads to more answers to your question.
M$http://sourceforge.net/projects/xmldoctor/
You can leave an optional "tip" with Mahalo's virtual currency, Mahalo Dollars. If you are asking a difficult question that might require some research, or if you'd like a wide variety of feedback, a higher tip often leads to more answers to your question.
M$
I found a solution:
1. Repair the docx zip with a zip repair function of a zip program like the one found in Ccy's HaHa Zip.
2. Extract the document.xml from the zip file.
3. Take that most often damaged document.xml and file and then run it through Tidy HTML with it's default options as presented here: http://infohound.net/tidy/
Tidy HTML recovers the text well and puts it into an HTML. The HTML is not formatted so will arrive in just one block of text in the web page.
Also here's even an application that does exactly what we are discussing, but it did not work on my corrupt documents. Perhaps it needs an added zip repair functionality: http://www.codeproject.com/KB/office/ExtractTextFromDOCXs.aspx