Wednesday, August 3, 2011

Oracle: Regular expression to extract text from rtf string

After few trial and errors, figured out a way to extract text from rtf string format.

with rtf as ( select
'{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 MS Sans Serif;}{\f1\fswiss\fcharset0 Arial;}{\f2\froman\fcharset0 Times New Roman;}}
{\colortbl ;\red0\green0\blue255;}
\viewkind4\uc1\pard\lang1033\f0\fs17 This is first line of rtf string with 01/02/1950 and some $10.20 with +ive sign
\par
\par e-mail second line rtf@rtf.com
\par \pard\sb100\sa100\cf1\lang1036\f1\fs24 Yes, Third line of rtf. \cf0\f2
\par \f1\fs20 some text \f2\fs24
\par \f1\fs20 123 456 abc 7890\f2\fs24
\par \pard\lang1033\f0\fs17
\par }' str from dual)
select str, regexp_replace(str, '({\\)(.*)(})|(\\[^ ]*)|(})') from rtf;

Here is the break down of Regular expression. It has 3 parts
({\\)(.*)(}) | (\\[^ ]*) | (})

Replace
1) Anything that starts with {\ followed by any any character(s) and ends with }
or
2) Any thing that starts with \ and not followed by a space
or
3) Replace the last }