Since my first steps in WordPress many years ago, I was often bothered when copying texts out of emails or old websites from clients. I didn’t understand why the WordPress editor, aka TinyMCE, doesn’t clean the pasted content from all kinds of classes, styles and tag attributes that for sure we don’t want to be taken over to not mess with our own.

The TinyMCE editor did very well when copying directly from Microsoft Word, at least removing most of the crap, but this handy feature was not implemented for any other sources. This situation is especially tragic when it comes to preventing clients from messing up their sites by pasting texts from all kind of sources. Most people nowadays are not even using Word from Microsoft but the free alternatives as OpenOffice or LibreOffice, which result in a horrible HTML code chaos behind the scenes. The tricky thing is that such dirty code is not visible to the layman’s eye and most often it is not interfering with the proper display. Yet, it can happen that a client asks you why parts of his content have a different color, font-size or even font that he cannot get back to normal no matter which buttons he uses. Even the rubber fails to clear some elements, and the “Paste as text” function is removing all our formatting. A look into the HTML source code then reveals the impuritities – all kinds of classes, styles and attributes sitting in the HTML tags.

My ideal pasting process

I was pondering about how my ideal cleaning process would look l like, and it is quite simple:

keep all formatting relevant tags as headings, paragraphs, strong and italic (this is why “paste as text” is not an option)
remove EVERYTHING inside HTML Tags directy during the pasting process (catching classes, styles and all other expected or unexpected attributes that in this way will never even appear)
with the exception of the A tag that should not loose its HREF, ID and _target information
remove images completely to avoid hotlinking (after all the images should be uploaded into the own WordPress installation)
remove certain entries completely (as iframes, nav, article, footer etc. which TinyMCE does by default but not with all structural elements)
replace br tag with p tag (sometimes we would like to keep line breaks, but much more often copying from some sources delivers all paragraphs as br which we don’t want to correct manually)
replace div tag with p tag (because we don’t want divs but sometimes they are used by badly programmed websites as paragraphs)
and then in the end, because we might end up with some double or triple empty paragraphs, we remove them

With the help of two programmers I made my dream reality, and I am amazed that a few lines of code have achieved this result that is making my own life and the one of all my clients so much better. Clean code, no unexpected hassle, and we can even copy a whole website into the editor and achieve just the plain content in a perfectly formatted form. Here and there we might still need to manually repair one of the other format, but no comparison to how lousy it worked before.

As an example, this was the default result when copying the content from a website:

<h1 data-fontsize="42" data-lineheight="57">Herzlich willkommen bei der Deutschen Akademie für traditionelles Yoga e.V.</h1>
<p><img class="crazy_lazy alignright wp-image-1840 size-medium" src="http://traditionelles-yoga.de/wp-content/uploads/yoga-tantra-freiburg-couple-300x203.jpg" alt="yoga-tantra-freiburg-couple" width="300" height="203" data-src="http://traditionelles-yoga.de/wp-content/uploads/yoga-tantra-freiburg-couple-300x203.jpg" /></p>
<p class="nobottomgap">Nähere Informationen zu den einzelnen Zentren und deren Programm finden sie auf den Städteseiten Augsburg, <a href="http://traditionelles-yoga.de/yoga-in-berlin/">Berlin</a>  und <a href="http://traditionelles-yoga.de/yoga-in-stuttgart/">Stuttgart</a>.</p>
<div class="box">
    <h2 data-fontsize="32" data-lineheight="40">International Shaivism Camp II</h2>
    <h3 data-fontsize="24" data-lineheight="37">with Adinathananda (Nicolae Catrina) – 13.04 – 17.04.2017 – Berlin, Germany<img class="crazy_lazy size-full wp-image-4545 alignright" src="http://traditionelles-yoga.de/wp-content/uploads/event-shivaism-berlin-200x150.jpg" alt="" width="200" height="150" data-src="http://traditionelles-yoga.de/wp-content/uploads/event-shivaism-berlin-200x150.jpg" /></h3>
</div>
<p><iframe id="player_1" src="https://www.youtube.com/embed/BxrJqronVEc?feature=oembed&amp;enablejsapi=1&amp;wmode=opaque" width="1200" height="675" frameborder="0" allowfullscreen="allowfullscreen" data-mce-fragment="1"></iframe></p>

<h1 data-fontsize="42" data-lineheight="57">Herzlich willkommen bei der Deutschen Akademie für traditionelles Yoga e.V.</h1>

<p class="nobottomgap">Nähere Informationen zu den einzelnen Zentren und deren Programm finden sie auf den Städteseiten Augsburg, <a href="http://traditionelles-yoga.de/yoga-in-berlin/">Berlin</a> und <a href="http://traditionelles-yoga.de/yoga-in-stuttgart/">Stuttgart</a>.</p>

<h2 data-fontsize="32" data-lineheight="40">International Shaivism Camp II</h2>

<h3 data-fontsize="24" data-lineheight="37">with Adinathananda (Nicolae Catrina) – 13.04 – 17.04.2017 – Berlin, Germany<img class="crazy_lazy size-full wp-image-4545 alignright" src="http://traditionelles-yoga.de/wp-content/uploads/event-shivaism-berlin-200x150.jpg" alt="" width="200" height="150" data-src="http://traditionelles-yoga.de/wp-content/uploads/event-shivaism-berlin-200x150.jpg" /></h3>

</div>

And this is the purified result, actually this should be standard:

<h1>Herzlich willkommen bei der Deutschen Akademie für traditionelles Yoga e.V.</h1>
<p>Nähere Informationen zu den einzelnen Zentren und deren Programm finden sie auf den Städteseiten Augsburg, <a href="http://traditionelles-yoga.de/yoga-in-berlin/">Berlin</a> und <a href="http://traditionelles-yoga.de/yoga-in-stuttgart/">Stuttgart</a>.</p>
<h2>International Shaivism Camp II</h2>
<h3>with Adinathananda (Nicolae Catrina) – 13.04 – 17.04.2017 – Berlin, Germany</h3>
<p>This spiritual retreat, unique due to the supramental approach of the Kashmir Shaivism tradition will offer, for the first time, among other techniques and esoteric tantric methods, the initiation in an exceptional traditional spiritual ceremony of awakening the formidable power of <strong>Shiva lingam</strong> (the essential godly creative power).</p>

<h1>Herzlich willkommen bei der Deutschen Akademie für traditionelles Yoga e.V.</h1>

<p>Nähere Informationen zu den einzelnen Zentren und deren Programm finden sie auf den Städteseiten Augsburg, <a href="http://traditionelles-yoga.de/yoga-in-berlin/">Berlin</a> und <a href="http://traditionelles-yoga.de/yoga-in-stuttgart/">Stuttgart</a>.</p>

<h2>International Shaivism Camp II</h2>

<h3>with Adinathananda (Nicolae Catrina) – 13.04 – 17.04.2017 – Berlin, Germany</h3>

<p>This spiritual retreat, unique due to the supramental approach of the Kashmir Shaivism tradition will offer, for the first time, among other techniques and esoteric tantric methods, the initiation in an exceptional traditional spiritual ceremony of awakening the formidable power of <strong>Shiva lingam</strong> (the essential godly creative power).</p>

Of course, if you don’t share my needs and want to keep iframes, you just remove the iframe tag from the list of processed tag elements. This is the code that does the magic, using the paste preprocess of Tiny MCE:

add_filter('tiny_mce_before_init', 'customize_tinymce');

function customize_tinymce($in) {
  $in['paste_preprocess'] = "function(pl,o){ 
  // remove the following tags completely:
    o.content = o.content.replace(/<\/*(applet|area|article|aside|audio|base|basefont|bdi|bdo|body|button|canvas|command|datalist|details|embed|figcaption|figure|font|footer|frame|frameset|head|header|hgroup|hr|html|iframe|img|keygen|link|map|mark|menu|meta|meter|nav|noframes|noscript|object|optgroup|output|param|progress|rp|rt|ruby|script|section|source|span|style|summary|time|title|track|video|wbr)[^>]*>/gi,'');
  // remove all attributes from these tags:
    o.content = o.content.replace(/<(div|table|tbody|tr|td|th|p|b|font|strong|i|em|h1|h2|h3|h4|h5|h6|hr|ul|li|ol|code|blockquote|address|dir|dt|dd|dl|big|cite|del|dfn|ins|kbd|q|samp|small|s|strike|sub|sup|tt|u|var|caption) [^>]*>/gi,'<$1>');
  // keep only href in the a tag (needs to be refined to also keep _target and ID):
  // o.content = o.content.replace(/<a [^>]*href=(\"|')(.*?)(\"|')[^>]*>/gi,'<a href=\"$2\">');
  // replace br tag with p tag:
    if (o.content.match(/<br[\/\s]*>/gi)) {
      o.content = o.content.replace(/<br[\s\/]*>/gi,'</p><p>');
    }
  // replace div tag with p tag:
    o.content = o.content.replace(/<(\/)*div[^>]*>/gi,'<$1p>');
  // remove double paragraphs:
    o.content = o.content.replace(/<\/p>[\s\\r\\n]+<\/p>/gi,'</p></p>');
    o.content = o.content.replace(/<\<p>[\s\\r\\n]+<p>/gi,'<p><p>');
    o.content = o.content.replace(/<\/p>[\s\\r\\n]+<\/p>/gi,'</p></p>');
    o.content = o.content.replace(/<\<p>[\s\\r\\n]+<p>/gi,'<p><p>');
    o.content = o.content.replace(/(<\/p>)+/gi,'</p>');
    o.content = o.content.replace(/(<p>)+/gi,'<p>');
  }";
  return $in;
}

add_filter('tiny_mce_before_init', 'customize_tinymce');

function customize_tinymce($in) {

$in['paste_preprocess'] = "function(pl,o){

// remove the following tags completely:

o.content = o.content.replace(/<\/*(applet|area|article|aside|audio|base|basefont|bdi|bdo|body|button|canvas|command|datalist|details|embed|figcaption|figure|font|footer|frame|frameset|head|header|hgroup|hr|html|iframe|img|keygen|link|map|mark|menu|meta|meter|nav|noframes|noscript|object|optgroup|output|param|progress|rp|rt|ruby|script|section|source|span|style|summary|time|title|track|video|wbr)[^>]*>/gi,'');

// remove all attributes from these tags:

o.content = o.content.replace(/<(div|table|tbody|tr|td|th|p|b|font|strong|i|em|h1|h2|h3|h4|h5|h6|hr|ul|li|ol|code|blockquote|address|dir|dt|dd|dl|big|cite|del|dfn|ins|kbd|q|samp|small|s|strike|sub|sup|tt|u|var|caption) [^>]*>/gi,'<$1>');

// keep only href in the a tag (needs to be refined to also keep _target and ID):

// o.content = o.content.replace(/<a [^>]*href=(\"|')(.*?)(\"|')[^>]*>/gi,'<a href=\"$2\">');

// replace br tag with p tag:

if (o.content.match(/<br[\/\s]*>/gi)) {

o.content = o.content.replace(/<br[\s\/]*>/gi,'</p><p>');

}

// replace div tag with p tag:

o.content = o.content.replace(/<(\/)*div[^>]*>/gi,'<$1p>');

// remove double paragraphs:

o.content = o.content.replace(/<\/p>[\s\\r\\n]+<\/p>/gi,'</p></p>');

o.content = o.content.replace(/<\<p>[\s\\r\\n]+<p>/gi,'<p><p>');

o.content = o.content.replace(/<\/p>[\s\\r\\n]+<\/p>/gi,'</p></p>');

o.content = o.content.replace(/<\<p>[\s\\r\\n]+<p>/gi,'<p><p>');

o.content = o.content.replace(/(<\/p>)+/gi,'</p>');

o.content = o.content.replace(/(<p>)+/gi,'<p>');

}";

return $in;

}

Just copy it into your functions.php of your child-theme, or download it direclty as plugin (no warranty, no update guarantee, but very enjoyable as-it-is). Let me know if you have further improvements.

Gutenberg Editor

One caveat might pop up at the horizon: The newly develped Gutenberg editor that will soon come as the new core editor into WordPress, offers a “Classical Text” module that continues the known experience for all those who are not so keen on the block based story but just want to continue with proper HTML code. Yet so far, in dev version 0.8 of Gutenberg, the paste process completely ceases to work when this plugin is activated. I hope that the developers share my dream of a clean code pasting experience and make it possible to still work or even code it into the core, so no matter if users prefer blocks or the classic TinyMCE, pasting text from no matter which source will be from now on a professional experience for webdesigners as for the not so tech-savvy clients.

7 Comments

Paul on December 14, 2017 at 5:04 pm

This is awesome! Took me a while to find it. Thanks for sharing!
Febby on August 14, 2018 at 6:34 am

Hi, thank you clean copy plugin is very useful for me. I want to keep in touch with you about this plugin if in the future it doesn’t work, I really need this plugin. Thanks
AB on February 17, 2020 at 7:45 pm

This is really cool!!! Thank you for sharing!
Mara on August 22, 2020 at 5:05 am

Thank you so much for sharing this! This is incredibly helpful–are you still using this code? Any drawbacks or vulnerabilities you’ve noticed? Any additions you’ve made? If so, would love to know!
- Sofian on August 20, 2022 at 7:28 pm
  
  Hey Mara, I haven’t used it for quite a while, but recently a colleague gave me a page to import that was full of classes. So I might consider putting it back to work again. But for now, I don’t know if it works with current WP versions.
Chelle on May 13, 2021 at 5:53 am

In your code snippet, I noticed that you have the line below “// keep only href in the a tag (needs to be refined to also keep _target and ID):” also commented out–is that a mistake? That line shouldn’t be commented out, right?
- Sofian on August 20, 2022 at 7:20 pm
  
  Hey Chelle, has been a while since I’ve used that code but I remember now, that line is commented most likely because I wanted to first find a way that it does keep _target and ID. But never got to move further on this.

WordPress: Removing classes, styles and all unwanted tag attributes during paste process

My ideal pasting process

Gutenberg Editor

7 Comments

Submit a Comment Cancel reply

Recent Posts

Categories