Парсинг HTML, поиск линков
Using the default for preg_match_all the array returned contains an array of the first ‘capture’ then an array of the second capture and so forth. By capture we mean patterns contained in ():

# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header.

1
2
3
4
5
6
7
8
9
$url = "http://www.example.net/somepage.html";
$input = @file_get_contents($url) or die('Could not access file: $url');
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\1[^>]*>(.*)<\/a>";

if(preg_match_all("/$regexp/siU", $input, $matches))
{
# $matches[2] = array of link addresses
# $matches[3] = array of link text - including HTML code
}

Using PREG_SET_ORDER each link matched has it’s own array in the return value:

# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header.

1
2
3
4
5
6
7
8
9
10
$url = "http://www.example.net/somepage.html";
$input = @file_get_contents($url) or die('Could not access file: $url');
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER))
{
   foreach($matches as $match)
   {
      # $match[2] = link address
     # $match[3] = link text
  }
}

If you find any cases where this code falls down, let us know using the Feedback link below.

First checking robots.txt

As mentioned above, before using a script to download files you should always check the relevant robots.txt file. Here we’re making use of the robots_allowed function from the article linked above to determine whether we’re allowed to access the file:

# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ini_set('user_agent', 'NameOfAgent (http://www.example.net)');
$url = "http://www.example.net/somepage.html";
if(robots_allowed($url, "NameOfAgent"))
{
  $input = @file_get_contents($url) or die('Could not access file: $url');
  $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\1[^>]*>(.*)<\/a>";
  if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER))
  {
    foreach($matches as $match)
    {
      # $match[2] = link address
     # $match[3] = link text
   }
  }
} else { die('Access denied by robots.txt'); }