PlayWright URL Scraping

Sample Code to get all URLs

December 10, 2024 by cryan.com

While experimenting with Playwright this week, I put together a script that grabs all the URLs from a website and writes them to a file. Here's the code that I finally came up with:


test('Extract and save my URLs from cryan.com', async ({ page }) => {
  // Navigate to the target URL
  await page.goto('https://www.cryan.com');
  // Extract all  tags with href attributes
  const links = await page.$$eval('a[href]', (anchors) => anchors.map((a) => a.getAttribute('href')));
  // Remove any relative URLs or empty strings
  const filteredLinks = links.filter((link) => link?.startsWith('http') && link.trim() !== '');
  // Save the unique URLs to a file
  const uniqueLinks = [...new Set(filteredLinks)];
  await fs.promises.writeFile('/Users/cryan/Desktop/url.txt', uniqueLinks.join('n'), 'utf8');
  // Assertions to validate extracted URLs
  expect(uniqueLinks.length).toBeGreaterThan(0); // Assert at least one URL found
});

This approach is particularly useful when you need to ensure that all the anchor tags on the homepage are functioning as expected. By verifying the anchor tags separately, you can isolate any issues related to broken or misconfigured links, making it easier to pinpoint and address problems.

Additionally, I'll create another test specifically to validate that the URLs associated with these anchor tags are correct. This two-pronged strategy ensures that both the structure and the destinations of your links are accurate.

Pro Tip: The reason for separating these tasks, instead of validating the URLs while scraping the homepage, is to enhance the efficiency of your test execution. By dividing the workload into smaller, targeted tests, you can leverage parallel execution to speed up the overall testing process. This approach not only reduces the total runtime of your test suite but also provides clearer insights into potential issues, allowing you to debug faster and more effectively.

PlayWright Permalink

Thursday 3	PlayWright
Friday 4	Macintosh
Saturday 5	Internet Tools
Sunday 6	Misc
Monday 7	Media
Tuesday 8	QA
Wednesday 9	Pytest

PlayWright URL Scraping

Sample Code to get all URLs

About

Schedule

Other Posts