I recently had an issue with a popular online publisher not being indexed as frequently as they should have. They suspected an issue with their crawling or indexing. On closer inspection, it appeared that google.com was dropping a high volume of connections from their web server.
I noticed a pattern in their system logs that looked like the following:
Connection timed out: google.com
Connection timed out: google.com
Connection timed out: google.com
This was happening constantly, and both over HTTP and HTTPS which was strange. After some more digging, it turned out that the outbound connections were processes responsible for reCAPTCHA requests (HTTPS) and automated sitemap submission (HTTP).
So it appears that if google.com blocks a server for one thing, other services could also be affected. The application in question is a commercial system (so we’re stuck with how it works), but I could see this happening to other platforms.
The interesting part worth sharing is that I came across the following page that lists the current IP addresses used by reCAPTCHA, this might come in handy for anyone needing this information:
If you have firewall ACLs, you must allow access to all all Google IP addresses. We strongly recommend that you either a) allow outbound access to all IPs on port 80 or b) use a proxy server to do access control based on host name.
The reCAPTCHA servers can be located on any IP address owned by Google. While we can not provide official support for IP Address-based ACLs, Google’s public IP space can be found by issuing the following command from a Linux/Unix box:
dig -t TXT _netblocks.google.com
The result right now is:
ip4:216.239.32.0/19 ip4:64.233.160.0/19 ip4:66.249.80.0/20 ip4:72.14.192.0/18 ip4:209.85.128.0/17 ip4:66.102.0.0/20 ip4:74.125.0.0/16 ip4:64.18.0.0/20 ip4:207.126.144.0/20 ip4:173.194.0.0/16
but you should periodically check this, as these blocks may occasionally change.
Source: https://code.google.com/archive/p/recaptcha/wikis/FirewallsAndRecaptcha.wiki
As it turned out, the server in question is constantly being hit by comment spam – so my guess is that the overuse of reCAPTCHA’s (or the high volume of failing reCAPTCHAs) caused the connections to start getting dropped. Adding some additional basic anti-spam features (preventing the use of a reCAPTCHA for obvious spammers), and tweaking some of the timings resolved the issue. The connections stopped timing out and things started flowing straight away.
The automated sitemap submission is now working, and the fresh content is now almost immediately indexed.
It’s a good reminder of how important crawling and indexing are for online publishers. It’s often taken for granted that it all “just works” – but it’s not always the case!