Quest: Oracle: Regular expression to remove html tags

Wednesday, May 12, 2010

Oracle: Regular expression to remove html tags

There are two ways to strip HTML tags from a string

1) Try to extract everything between end html tag(>)and next starting html tag (<)

2) First approach becomes complicated when you have nested tags. So the solution is to replace everything that is between (<) and (>)

SQL below replaces all the tags with space.

Select Regexp_Replace('<td>source string <b>bold<\b> with html tags<\td>', '<[^>]+>','') from dual

4 comments:

HarrisMay 19, 2010 at 10:49 AM
This is good, but how do you replace special characters of html tags like &nbsp, &quot and many more.
ReplyDelete
Replies
RManniMay 19, 2010 at 8:39 PM
since &nbsp,&quot are constants, you can either use replace or regexp_replace

regexp_replace(source_string,'[&nbsp]|[&quot]','')
ReplyDelete
Replies
AnonymousJune 19, 2013 at 1:10 AM
Hi There ,

The information you shared is pretty helpful however how should i modify my statement
'SELECT
REGEXP_REPLACE(PPRO.DESCRIPTION, '<[^>]+>', '') "Dup_DESCRIPTION"
FROM WMSPT_PROJECTS_PROJECTS ppro'

to remove html tags Plus constants also in the same statement i.e what should be the pattern '<[^>]+>' ??
ReplyDelete
Replies
AnonymousJune 19, 2013 at 7:22 AM
Ohh I got it now,It should be :
SELECT PPRO.PROJECT_ID,
REGEXP_REPLACE(PPRO.DESCRIPTION, '<[^>]+>|\&(nbsp;)|(amp;)', ' ') "Dup_DESCRIPTION"
FROM WMSPT_PROJECTS_PROJECTS ppro

ReplyDelete
Replies

Add comment