Extract & mirror cache url’s from google search pages
Saved search pages go in, cache links come out.
It’s handy for mirroring a dead site by using site:domain.com as the search parameter.
Notes: Without rate limiting I was blocked after request #169. However, there were no issues when using the limits below. The wait time can probably go much lower though. The empty user-agent is required for wget to work.
pcregrep -hoM http://webcache\.googleusercontent\.com/search\\?q\\=cache:\(.+?\)\(?=[+]\(.+\)\"\(.*\)\>Cached\) searc*.html > cachelist.txt(search?q=cache:4Ip_t8yQ-rL2:)
wget --wait 15s --random-wait --user-agent="" -i cachelist.txt
To match the junk part of the filenameuse this:
search\?q=cache:............:
edit: updated for new search output & lowered wait to 15s