Skip to content Skip to sidebar Skip to footer

Html Agility Pack Url Scraping-- Getting Full Html Link

Hi I am using html agility pack from the nuget packages in order to scrape a web page to get all of the urls on the page. The code is shown below. However the way it returns to me

Solution 1:

You can check the HREF value if it's relative URL or absolute. Load the link into a Uri and test whether it is relative If it relative convert it to absolute will be the way to go.

staticvoidMain(string[] args)
    {
        List<string> linksToVisit = ParseLinks("https://www.facebook.com");
    }

publicstatic List<string> ParseLinks(string urlToCrawl)
    {

        WebClient webClient = new WebClient();

        byte[] data = webClient.DownloadData(urlToCrawl);
        string download = Encoding.ASCII.GetString(data);

        HashSet<string> list = new HashSet<string>();

        var doc = new HtmlDocument();
        doc.LoadHtml(download);
        HtmlNodeCollection nodes =    doc.DocumentNode.SelectNodes("//a[@href]");

            foreach (var n in nodes)
            {
                string href = n.Attributes["href"].Value;
                list.Add(GetAbsoluteUrlString(urlToCrawl, href));
            }
        return list.ToList();
    }

Function to convert Relative URL to Absolute

staticstringGetAbsoluteUrlString(string baseUrl, string url)
{
    var uri = new Uri(url, UriKind.RelativeOrAbsolute);
    if (!uri.IsAbsoluteUri)
        uri = new Uri(new Uri(baseUrl), uri);
    return uri.ToString();
}

Solution 2:

You can't get the complete url because in the href attribute there isn't the complete url. Example: <a href="/foobar"></a> In your case the page contains relative urls. You need to do this:

string href = email + n.Attributes["href"].Value;

In this way you will have the full url. The better solution is to check if url is relative or absolute and, if the url is relative, add email at the beginning of the url otherwise no.

Post a Comment for "Html Agility Pack Url Scraping-- Getting Full Html Link"