How to Strip Tags in Three Easy Lessons
by Ben Doom
One particular request I've seen on CF-Talk and CF-Regex on numerous occasions is for a method for "stripping tags" from text. So I said to myself, "Self, let's write an article about this so people will stop asking, and know how to do it themselves." And so I shall.
This article will cover the basics of how to use Regular Expressions, one of the coolest and geekiest inventions ever, to rid our code of embedded tags.
As the title of this article implies, there are three steps to this process. The first step is to define our goal as tightly as possible. In this case, we need to define exactly how we want to strip tags. The second step is to look around to see if we are simply re-inventing the wheel, or if we have a problem to which there is no satisfactory solution available. Finally, in step three, we'll write some code.
Before I begin, I want you to understand that this is not meant to be a rigorous or complete study of regular expressions. It's designed to give you a good idea of where to start. Tweaking your code to suit your particular needs will almost certainly be necessary.
Step One: Define the Goal
As programmers, we've all gotten into messes because what seemed obvious to us wasn't necessarily so obvious to others. After all, when you say "strip tags," you know exactly what you mean. The question is, when you ask someone to help you do it, do they know what you want? The phrase "stripping tags" is a deceptively simple one. While it seems clear, people use it to mean all kinds of things. Some of its implied definitions have included:
- removing some or all of the tags in a given piece of text
- removing tag pairs along with enclosed content (i.e., removing links entirely)
- converting tags so that they will not be parsed by a browser
As you can see, it's important to define our goal in specific terms.
For the purpose of this article, we will do three things. First, we will remove all opening and closing anchor tags while leaving the contained text in place. Second, we will remove all script tags and their contents. Finally, we will remove any tag that contains a JavaScript call.
Two: Reinventing the wheel.
I'm frequently told not to reinvent the wheel. Most of the time, this is good advice. After all, there are some well constructed and sturdy regular expression "wheels" already available. Let's look at some other possible meanings of the phrase "strip tags" and some tools already available for the task.
One example I gave was to convert tags into content that displays as text rather than being interpreted and processed as code. When this question was asked, a dozen people responded, "htmleditformat()". Enough said about this one.
More commonly, I am asked about the possibility of stripping (or leaving) specific tags. For this, I generally point to a solution developed by S. Isaac Dealey: a UDF called StripTags() available from the Common Function Library (http://www.cflib.org/udf.cfm?ID=774).
Finally, there's a really simple regular expression to remove all HTML-style tags from a document. I'm sure you could find it by looking in the House of Fusion archives or by Googling. I'll leave this one as an exercise for the student.
However, we've already determined that we need a custom solution. Besides, you wouldn't learn anything if we didn't write some code. So, on to step three.
Three: Deciphering the Code
As a reminder, our task is threefold:
- We must remove the anchor tags while leaving the text between them in place.
- We must remove script tags and their contents.
- We must remove any tag that contains a JavaScript call.
- Remove <script> tags and their contents
Since this is the simplest of the three operations, we'll tackle it first by defining a regular expression pattern that describes a script tag.
<script.*?>
Though simple, this is a very powerful piece of code. The .*? section tells the regular expression that we're looking for anything at all (.), as much as necessary (*), but no more than we need to find the first > (?). So, this finds any tag which begins <script and ends in a >.
Next, we need to represent a closing script tag:
</script>
and what's in between:
.*?
Therefore, the <script> block can be represented by properly concatenating these three things:
Now, we use rereplacenocase() to perform a case-insensitive search and replace across all the contents of the variable "text" (where we've ostensibly placed our target text):
text = ReReplaceNoCase (text, "<script.*?>.*?</script>", "", "all");
|
The first task is done.
- Remove <a> tags but keep the text between them
Not surprisingly, since we're performing much the same action as we did for <script> tags, our code will begin similarly. Specifically, we can begin by replacing "script" with "a":
text = ReReplaceNoCase(text, "<a.*?>.*?</a>", "", "all");
|
We've decided we want to keep the information between anchor tags. There are two ways to do that. One way would be to use two ReReplace() calls to remove only the opening and closing tags. However, a more elegant way is to store the text between the anchor tags and re-insert it.
In a regular expression, we use parentheses for (among other things) storing a string in a backreference. You can think of a backreference as a variable internal to a regex call. So we'll stick parentheses around the bit between the tags:
text = ReReplaceNoCase (text, "<a.*?>(.*?)</a>", "", "all");
|
That stores it, but we still have to re-insert it. Essentially, we want to replace the whole link with just the text. So instead of replacing the substring matched by our regular expression with an empty string, we will replace it with our backreference. Backreferences are numbered from one. Therefore, to call our first backreference, we use a backslash followed by the number one: "\1".
text = rereplacenocase(text, "<a.*?>(.*?)</a>", "\1", "all");
|
Task two is done as well.
- Remove all tags that call JavaScript
First, we know we want to find every tag that contains the word JavaScript, so let's start with some brackets and text:
< javascript >
We can make our pattern a little more explicit by knowing that JavaScript calls generally look like this:
< ="javascript:" >
Next, we need something to represent "any tag." We could turn to our old friend ".*?" but that pattern could cause some serious problems. For example, it would match
<b>="javascript:"</b>
Since that's not our goal, we can't use ".*?". But we know that the javascript comes before a closing bracket. So we will create a character class that allows anything except a closing bracket:
[^>]
We can use a similar technique to create a placeholder for the JavaScript function itself. We know that it cannot contain a double quote, since that's used to delimit the whole JavaScript call, so we define a character class of "not double quotes" like so:
[^"]
If we take that character class and again use "*" to represent the possibility that there could be zero or more of these characters, we get
<[^>]*="javascript:[^"]*"[^>]*>
|
To clarify:
| < | | Start with the open bracket of the tag containing the call. |
| [^>] | | Followed by any character other than a closed bracket |
| * | | As many times as necessary |
| ="javascript: | | This begins a JavaScript call |
| [^"] | | Not including any quote marks |
| * | | As many times as necessary |
| " | | Followed by a quotation mark |
| [^>] | | Followed by any character other than a closed bracket |
| * | | As many times as necessary |
| > | | Ending with a closed bracket |
| |
So now we simply pull out our delightfully case insensitive, regular expression based replace function:
text = ReReplaceNoCase (text, '<[^>]*="javascript:[^"]*"[^>]*>', '', 'all');
|
We've finished with our third task.
Where to Go Next
In many ways, regular expressions are like any programming language. It's a different idiom, I'll grant, but the process is the same. Define the problem; look for outside solutions; write or modify code as necessary.
Looking for more help? Here are some resources for the RegEx-challenged:
- Try the CF-RegEx list at the House of Fusion (http://www.houseoffusion.com) where a number of masters of CF and regular expressions answer questions both simple and complex.
- Download the Windows beta version of the free KDE Visual Regular Expression Editor from the CFRegex.com site (http://www.cfregex.com) and experiment with building more regular expressions.
- Want more information about Regular Expressions in general? Mastering Regular Expressions 2nd Edition from O'Reilly books is great for beginners just starting out, intermediate users looking to expand their knowledge, or experts seeking a good reference.
Benjamin C. Doom is a dashing and debonair young geek from London, KY. He first encountered Regular Expressions in Perl as an Assistant to the Network Administrator (or, as he was more commonly known, Utility Geek) while attending Macalester College in St. Paul, MN. He currently earns his lunch money as a ColdFusion programmer at Moonbow Software, Inc. (http://www.moonbow.com)