Next Question
RSS
Wow.....yikes. Hmmm. The unfortunate part is that if an XML document is not "well-formed", then as far as the parser is concerned, it might as well not even be an XML document at all. It either is or isn't, I know of no parsers that can say something like "Close enough."
In Unix, in a situation like this, I think I would head for the "strings" utility which scrapes out everythign that looks like text (i.e. 7 bit character strings of meaningful length) from a binary (i.e. 8bit) file. If the XML portion you seek is stored in a fairly unencrypted way inside the docx file, you could scrape the text out that way and see what it takes to do a manual repair?
Sorry, that's the best I've got for now. I've attached a link to a version of strings for Windows if you need it. If I think of anything else I'll come back and comment on myself.
UPDATE : Just tried strings on a typical docx file and I don't think that's what you need. Sorry I could not be of more help. Are you saying that you do in fact have a document.xml file? A text file that is just not well formed? Have you tried hitting it with a SAX parser or StAX (streaming) parser, which will only look at elements as they come up and might give you a higher quality parse?
Source(s):
http://blog.stevienova.com/2004/07/15/strings-for-windows-xp/
Permalink | Report
http://sourceforge.net/projects/xmldoctor/
Source(s):
http://sourceforge.net/projects/xmldoctor/
Permalink | Report
Answered Question
M$2
February 20, 2009 01:59 AM
Is there a good open source program for repairing malformed XML?
I'm trying to build an application which will extract the text from the document.xml part of a corrupted Word 2007/docx file. XML extractor modules seem to require well formed XML and if Word can't extract the text it usually means the document.xml part is partial and no longer well formed.
Interesting Question?
Yes (0)
No (0)
- In Programming |
- |
- Report |
-
Share
RSS
Best Answer Chosen by Asker
| February 20, 2009 09:47 PM |
In Unix, in a situation like this, I think I would head for the "strings" utility which scrapes out everythign that looks like text (i.e. 7 bit character strings of meaningful length) from a binary (i.e. 8bit) file. If the XML portion you seek is stored in a fairly unencrypted way inside the docx file, you could scrape the text out that way and see what it takes to do a manual repair?
Sorry, that's the best I've got for now. I've attached a link to a version of strings for Windows if you need it. If I think of anything else I'll come back and comment on myself.
UPDATE : Just tried strings on a typical docx file and I don't think that's what you need. Sorry I could not be of more help. Are you saying that you do in fact have a document.xml file? A text file that is just not well formed? Have you tried hitting it with a SAX parser or StAX (streaming) parser, which will only look at elements as they come up and might give you a higher quality parse?
Source(s):
http://blog.stevienova.com/2004/07/15/strings-for-windows-xp/
| Asker's Rating: |
Permalink | Report
Other Answers (1)
February 20, 2009 03:01 AM
XML Doctor is a smart XML editor for Windows that works with your DTD or schema files to help you edit or repair your document, using an intiutive tree view. Pick child elements and attributes from a list. Errors are flagged in red. http://sourceforge.net/projects/xmldoctor/
Source(s):
http://sourceforge.net/projects/xmldoctor/
Permalink | Report
February 20, 2009 03:11 AM
The two corrupt document.xml files I have won't open in XML Doctor. I get fatal errors, however thanks.
Report
Answer this Question
Related Questions
Have you ever used LIHEAP for home energy assistance? If so what was your experience?...
How do I download videos from myspace?
How long would a 1,600 mile voyage to South America, by Sail boat, take once around t...
I want a particular woman to be the mother of my kid, but she doesn't want to have an...
How do I download videos from myspace?
How long would a 1,600 mile voyage to South America, by Sail boat, take once around t...
I want a particular woman to be the mother of my kid, but she doesn't want to have an...
Ask a Question
Buy Mahalo Dollars with Credit Card or PayPal
Top Members
Most Popular Tags
Categories
- Anonymous
- Arts & Design
- Beauty & Style
- Books & Authors
- Business
- Cars & Transportation
- Consumer Electronics
- Coupons Deals
- Education
- Entertainment
- Environment
- Fitness
- Food & Drink
- From Email
- From Iphone
- From Twitter
- Health
- History
- Hobbies
- Home & Garden
- How Tos
- Humor
- Jobs
- Legal
- Local
- Love & Relationships
- Mahalo Answers Community
- Money
- Music
- News
- NSFW
- Parenting
- Pets
- Science & Mathematics
- Services
- Shopping
- Social Science
- Society & Culture
- Sports
- Technology & Internet
- Travel
- Video Games
Welcome New Members
- conforama, December 15, 2009 02:13 PM
- deerslayer, December 15, 2009 02:04 PM
- go_egypt_travel, December 15, 2009 02:01 PM
- 43965, December 15, 2009 02:01 PM
- ivanmwase, December 15, 2009 01:57 PM
Mahalo Dollars are the currency of Mahalo Answers.
Each Mahalo Dollar costs $1.
Once you earn more than 40 Mahalo Dollars, you can request to be paid via PayPal. Each Mahalo Dollar is currently worth $0.75 when paid out via PayPal. Learn More
1. Repair the docx zip with a zip repair function of a zip program like the one found in Ccy's HaHa Zip.
2. Extract the document.xml from the zip file.
3. Take that most often damaged document.xml and file and then run it through Tidy HTML with it's default options as presented here: http://infohound.net/tidy/
Tidy HTML recovers the text well and puts it into an HTML. The HTML is not formatted so will arrive in just one block of text in the web page.