Wednesday, May 12, 2010

Oracle: Regular expression to remove html tags

There are two ways to strip HTML tags from a string



1) Try to extract everything between end html tag(>)and next starting html tag (<)



2) First approach becomes complicated when you have nested tags. So the solution is to replace everything that is between (<) and (>)



SQL below replaces all the tags with space.


Select Regexp_Replace('<td>source string <b>bold<\b> with html tags<\td>', '<[^>]+>','') from dual

4 comments:

  1. This is good, but how do you replace special characters of html tags like &nbsp, &quot and many more.

    ReplyDelete
  2. since &nbsp,&quot are constants, you can either use replace or regexp_replace

    regexp_replace(source_string,'[&nbsp]|[&quot]','')

    ReplyDelete
  3. Hi There ,

    The information you shared is pretty helpful however how should i modify my statement
    'SELECT
    REGEXP_REPLACE(PPRO.DESCRIPTION, '<[^>]+>', '') "Dup_DESCRIPTION"
    FROM WMSPT_PROJECTS_PROJECTS ppro'

    to remove html tags Plus constants also in the same statement i.e what should be the pattern '<[^>]+>' ??

    ReplyDelete
  4. Ohh I got it now,It should be :
    SELECT PPRO.PROJECT_ID,
    REGEXP_REPLACE(PPRO.DESCRIPTION, '<[^>]+>|\&(nbsp;)|(amp;)', ' ') "Dup_DESCRIPTION"
    FROM WMSPT_PROJECTS_PROJECTS ppro

    ReplyDelete