PHP: Parsing HTML to find Links
Categories: PHP on Sep.03, 2008
Парсинг HTML, поиск линков
Using the default for preg_match_all the array returned contains an array of the first ‘capture’ then an array of the second capture and so forth. By capture we mean patterns contained in ():
# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header.
1 2 3 4 5 6 7 8 9 | $url = "http://www.example.net/somepage.html"; $input = @file_get_contents($url) or die('Could not access file: $url'); $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches)) { # $matches[2] = array of link addresses # $matches[3] = array of link text - including HTML code } |
Using PREG_SET_ORDER each link matched has it’s own array in the return value:
# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header.
1 2 3 4 5 6 7 8 9 10 | $url = "http://www.example.net/somepage.html"; $input = @file_get_contents($url) or die('Could not access file: $url'); $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { # $match[2] = link address # $match[3] = link text } } |
If you find any cases where this code falls down, let us know using the Feedback link below.
First checking robots.txt
As mentioned above, before using a script to download files you should always check the relevant robots.txt file. Here we’re making use of the robots_allowed function from the article linked above to determine whether we’re allowed to access the file:
# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | ini_set('user_agent', 'NameOfAgent (http://www.example.net)'); $url = "http://www.example.net/somepage.html"; if(robots_allowed($url, "NameOfAgent")) { $input = @file_get_contents($url) or die('Could not access file: $url'); $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { # $match[2] = link address # $match[3] = link text } } } else { die('Access denied by robots.txt'); } |
Similar posts:
July 1st, 2011 on 7:17 am
Thank You!
September 25th, 2011 on 8:39 pm
Insert your card Jana Model Maxwell
vkpm