How to Find All Present and Archived URLs on a Website

There are several good reasons you may have to have to locate each of the URLs on a website, but your precise aim will establish Anything you’re searching for. For instance, you may want to:

Identify each indexed URL to analyze issues like cannibalization or index bloat
Collect current and historic URLs Google has found, specifically for web-site migrations
Come across all 404 URLs to Get well from submit-migration problems
In Each and every situation, a single Software received’t Supply you with almost everything you may need. Sadly, Google Lookup Console isn’t exhaustive, as well as a “web site:example.com” look for is restricted and hard to extract data from.

With this put up, I’ll walk you thru some equipment to construct your URL record and in advance of deduplicating the data employing a spreadsheet or Jupyter Notebook, dependant upon your site’s sizing.

Outdated sitemaps and crawl exports
In case you’re trying to find URLs that disappeared from the Stay web site recently, there’s an opportunity somebody on your own group can have saved a sitemap file or possibly a crawl export before the improvements had been built. In case you haven’t already, look for these files; they could frequently present what you'll need. But, when you’re examining this, you most likely didn't get so lucky.

Archive.org
Archive.org
Archive.org is a useful tool for Web optimization duties, funded by donations. When you look for a website and choose the “URLs” alternative, you could access as many as 10,000 detailed URLs.

Having said that, There are some constraints:

URL Restrict: You are able to only retrieve as many as web designer kuala lumpur ten,000 URLs, which can be inadequate for larger web sites.
Excellent: Many URLs could possibly be malformed or reference useful resource files (e.g., photos or scripts).
No export selection: There isn’t a crafted-in method to export the listing.
To bypass The dearth of the export button, use a browser scraping plugin like Dataminer.io. Nonetheless, these constraints necessarily mean Archive.org may not offer a whole Option for larger sized websites. Also, Archive.org doesn’t reveal whether Google indexed a URL—but when Archive.org located it, there’s a very good chance Google did, as well.

Moz Pro
Even though you would possibly ordinarily use a hyperlink index to discover external internet sites linking for you, these instruments also explore URLs on your web site in the procedure.

How to use it:
Export your inbound hyperlinks in Moz Pro to get a fast and easy list of target URLs from your website. Should you’re managing an enormous Site, consider using the Moz API to export info over and above what’s manageable in Excel or Google Sheets.

It’s vital that you Be aware that Moz Pro doesn’t verify if URLs are indexed or found by Google. Nevertheless, because most websites apply exactly the same robots.txt guidelines to Moz’s bots as they do to Google’s, this technique frequently will work very well for a proxy for Googlebot’s discoverability.

Google Research Console
Google Lookup Console provides a number of valuable resources for making your listing of URLs.

Back links reviews:

Similar to Moz Professional, the Inbound links part supplies exportable lists of goal URLs. Unfortunately, these exports are capped at one,000 URLs Just about every. It is possible to use filters for particular internet pages, but due to the fact filters don’t implement towards the export, you might really need to trust in browser scraping equipment—limited to five hundred filtered URLs at a time. Not ideal.

Performance → Search engine results:

This export gives you an index of internet pages obtaining lookup impressions. Even though the export is restricted, You can utilize Google Search Console API for bigger datasets. Additionally, there are totally free Google Sheets plugins that simplify pulling more considerable facts.

Indexing → Web pages report:

This segment offers exports filtered by difficulty form, however they're also restricted in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a superb source for accumulating URLs, using a generous Restrict of one hundred,000 URLs.

Even better, you may use filters to create unique URL lists, efficiently surpassing the 100k Restrict. By way of example, in order to export only web site URLs, comply with these actions:

Action 1: Insert a segment to your report

Move two: Simply click “Produce a new section.”

Step 3: Determine the segment by using a narrower URL sample, like URLs made up of /blog/

Notice: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide precious insights.

Server log files
Server or CDN log information are Potentially the ultimate Device at your disposal. These logs seize an exhaustive record of every URL path queried by end users, Googlebot, or other bots during the recorded time period.

Things to consider:

Data measurement: Log documents is usually huge, countless web pages only keep the final two months of data.
Complexity: Analyzing log files could be hard, but many tools are available to simplify the process.
Mix, and very good luck
As you’ve collected URLs from all these resources, it’s time to mix them. If your web site is sufficiently small, use Excel or, for larger datasets, tools like Google Sheets or Jupyter Notebook. Make sure all URLs are constantly formatted, then deduplicate the checklist.

And voilà—you now have a comprehensive listing of current, aged, and archived URLs. Good luck!

How to Find All Present and Archived URLs on a Website

How to Find All Present and Archived URLs on a Website

Leave a Reply Cancel reply

Links

Visitors

Archives

Categories

Meta