CF Function to Clean MS Word HTML Mess

I was have a trouble with our clients using our CMS system, that is, they used to copy their text from MS Word and paste into the CMS editor directly. If you already know, that will make trouble and create a lot of unwanted HTML and XML tags  (as I see on internet people call it MS Word Mess). I Used to paste the MS Word stuff into the design area of Dreamweaver because it gives me a control and options to clean unwanted things while pasting. you can find that options by :

Click in Design area in Dreamweaver > open Edit meun and select “Paste Special…

Then select the cleaning option you want.  Still after that you may need to do some RegExp replacements.

But as you know you can’t ask your clients to use Dreamweaver to clean MS Word mess.

Clean MSWord Mess Function

I decided to create a ColdFusion Function to replace all unwanted HTML and XML Tags from MS Word pasted text. Here is the function:

<cffunction name="cleanWordMess" output="no" returntype="string">
 <cfargument name="inString" default="">
 <!--- if nothing passed , return empty string --->
 <cfif Not Len(Trim(arguments.inString))><cfreturn "" /></cfif>
 <!--- create a tmporary variable to cold the passed text --->
 <cfset local.text = arguments.inString />
 <!--- remove the HTML comments --->
 <cfset local.text = REReplace(local.text, "<!--.*-->", "", "ALL") />
 <!--- remove most of the unwanted HTML attributes with their values --->
 <cfset local.text = REReplace(local.text, "[ ]+(style|align|valign|dir|class|id|lang|width|height|nowrap)=""[^""]*""", "", "ALL") />
 <!--- clean extra spaces & tabs --->
 <cfset local.text = REReplace(local.text, "\s{2,}", " ", "ALL") />
 <!--- remove exra spaces between tags --->
 <cfset local.text = REReplace(local.text, ">\s{1,}<", "><", "ALL") />
 <!--- remove any &nbsp; spaces between tags --->
 <cfset local.text = REReplace(local.text, ">&nbsp;<", "><", "ALL") />
 <!--- remove empty <b> empty tags --->
 <cfset local.text = REReplace(local.text, "<b></b>", "", "ALL") />
 <!--- remove empty <p> empty tags --->
 <cfset local.text = REReplace(local.text, "<p></p>", "", "ALL") />
 <!--- Remove all unwanted tags opening and closing --->
 <cfset local.text = REReplace(local.text, "</?(span|div|o:p|p)>", "", "ALL") />
 <!--- remove and repetation of &nbsp; and make it one only --->
 <cfset local.text = REReplace(local.text, "(&nbsp;){2,}", "&nbsp;", "ALL") />
 <cfreturn local.text />

Of course the sequence of replacement code is very important. I think you know how to call this function .. or do I need to explain!  🙂

Feel free to use the function personally, commercially .. whatever.