![]() I don't see any need to explicitly match the attribute names and other stuff inside the tag, like the regex in Ryan's answer does the first alternative handles all of that. Once you've done all that, you'll end up with words, sentences and paragraphs run together in some places, and big chunks of useless whitespace in others.īut, assuming you're working with just a fragment and you can get away with simply removing all tags, here's the regex I would use: single- and double-quoted strings in their own alternatives is sufficient to deal with the problem of angle brackets in attribute values. That's just off the top of my head-I'm sure there's more. remove the sequences from CDATA sections but leave their contents alone.Are you talking about removing all tags from a real-world HTML document, like a web page? If so, you would have to: The question is too broad to be answered definitively. If you can find a better/faster parser than HTML Agility Pack, go for it, but please don't subject the world to more broken HTML hackery. Html Agility Pack can provide you a robust solution that will reduce the need to manually fix up the aberrations that can result from naively treating HTML as a context-free grammar.Ī regular expression may get you mostly what you want most of the time, but it will fail on very common cases. You can get part way there with a RegEx, but you'll need to do manual verifications. There are very few defensible cases for using a regular expression for parsing HTML, as HTML can't be parsed correctly without a context-awareness that's very painful to provide even in a nontraditional regex engine. String textOnly = HttpUtility.HtmlDecode(output.ToString()) StringBuilder output = new StringBuilder() Var text = ("//body//text()").Select(node => node.InnerText) To shamelessly steal from the comment below by jesse, and to avoid being accused of inadequately answering the question after all this time, here's a simple, reliable snippet using the HTML Agility Pack that works with even most imperfectly formed, capricious bits of HTML: HtmlDocument doc = new HtmlDocument() ĭoc.LoadHtml() ![]() The correct answer is don't do that, use the HTML Agility Pack.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |