Blog postPHP's DomDocument tutorial

Processing HTML using PHP's DomDocument class

Published May 16, 2018

In the previous tutorial I briefly presented the subject of regular expressions in PHP. As a result, I received a number of comments all saying the same thing that it is not a good practice to parse HTML using regular expressions, so as I answered to one of the responder, it is useful to do parsing when the structure of the string is known in advance. In this tutorial, I offer a complementary approach that uses DomDocument, which is a built-in PHP class that can parse HTML code, find matches and replace parts of the HTML without the need for regular expressions.

This tutorial consists of 4 case studies:

  1. Case 1: How to automagically make responsive images from all the images in the page
  2. Case 2: How to make responsive videos from all the Youtube videos in the page
  3. Case 3: How to remove the style tags from the HTML elements in the page
  4. Case 4: How to automatically add rel = nofollow to all the links

# Case 1:How to automagically make responsive images from all the images in the page

To make responsive images we're going to use the DomDocument to wrap the images in a div that has the class of 'responsive-img'.

function makeResposiveImages($html='')
{
  // Create a DOMDocument
  $dom = new DOMDocument();
	
  // Load html including utf8, like Hebrew
  $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
	
  // Create the div wrapper
  $div = $dom->createElement('div');
  $div->setAttribute('class', 'responsive-img');
	
  // Get all the images
  $images = $dom->getElementsByTagName('img');
 
  // Loop the images
  foreach ($images as $image) 
  {
    //Clone our created div
    $new_div_clone = $div->cloneNode();
		
    //Replace image with wrapper div
    $image->parentNode->replaceChild($new_div_clone,$image);
		
    //Append image to wrapper div
    $new_div_clone->appendChild($image);
  }
	
  // Save the HTML
  $html = $dom->saveHTML();
	
  return $html;
}

To use the DomDocument class we first need to instantiate it using

$dom = new DOMDocument();

When loading the HTML it is desirable to use the UTF-8 parameter for languages other than English.

$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); 

We create a wrapping div for the images that has the class 'responsive-img'

$div = $dom->createElement('div');
$div->setAttribute('class', 'responsive-img');

In order to extract the images from the HTML:

$images = $dom->getElementsByTagName('img');

Next, we loop through the images and wrap each one with the wrapping div.

At the end, we save the changes with:

$html = $dom->saveHTML();

# Case 2: How to make responsive videos from all the Youtube videos in the page

The following code identifies Youtube iframes that are embedded in a given HTML input and makes changes that include adding autoplay tag, changing the dimensions of the video, and adding a class.

function betterEmbeddedYoutubeVideoes($html='') 
{
  $doc = new DOMDocument();
	
  $doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); 
	
  $docVideos = $doc->getElementsByTagName('iframe');
	
  foreach($docVideos as $docVideo)
  {
    // Get the 'src' attribute of the iframe
    $docVideoSrc = $docVideo->getAttribute('src');
		
    // Parse the src url
    $docVideoSrcParts = parse_url($docVideoSrc);
		
    // Add autoplay attribute to the src
    $newVideoSrc = $docVideoSrcParts['scheme'] . '://' . $docVideoSrcParts['host'] . '/' . $docVideoSrcParts['path'] . '?autoplay=1';
		
    // Set the source
    $docVideo->setAttribute('src', $newVideoSrc);
		
    // Set the dimensions
    $docVideo->setAttribute('height', '433');
    $docVideo->setAttribute('width', '719');
		
    // Set the class
    $docVideo->setAttribute('class', 'embed-responsive-item');
  }
	
  $html = $doc->saveHTML();
	
  return $html;
}

# Case 3: How to remove the style tags from all the HTML elements in the page

My customers like to use text-editing WYSIWYG plugins that make data entry easier for them. Editing the text in this way causes style tags to be inserted in all kinds of places, resulting in search engines and screen readers having difficulty processing the content of the page. So that the problem is twofold. Both in promoting the site in the search engines and in making the site accessible to people with disabilities that use screen readers to interact with the website.

I address the problem with the following function that cleans the style tags using the DOMDocument class.

function stripStyleTags($html='')
{
  $dom = new DOMDocument;                
	
  $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
	
  $xpath = new DOMXPath($dom); 
	
  // Find any element with the style attribute
  $nodes = $xpath->query('//*[@style]');  
	
  // Loop the elements
  foreach ($nodes as $node)               
  {             
    // Remove style attribute
    $node->removeAttribute('style');
  }
  $html = $dom->saveHTML(); 
	
  return $html;
}

# Case 4: How to automatically add rel = nofollow to all the links

The rel = nofollow attribute is added to the link tag, and tells the search engine spiders to avoid entering the link and leaving the page. It is customary to use the technique to avoid reducing the page's ranking in search results.

The code below automatically adds the attribute to all the links in a given HTML input.

function addRelNofollowToLinks($html) 
{
  $dom = new DOMDocument;                
            
  $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
            
  // Find any element which is a link
  $nodes = $dom->getElementsByTagName('a');  
            
  // Loop the elements
  foreach ($nodes as $node)               
  {             
    // Add the rel attribute
    $node->setAttribute('rel', 'nofollow');
  }
  
  $html = $dom->saveHTML(); 
            
  return $html;
}

Conclusion

Now that you know the basics of using the DomDocument class for parsing HTML you can further invest in your professional skills and buy the essentials of object oriented PHP, the most easy to learn from eBook in the field.

comments powered by Disqus