[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] default ubuntu mirrors

To: list@xxxxxxxxxxxxx
Subject: Re: [LUG] default ubuntu mirrors
From: comrade meowski <mr.meowski@xxxxxxxx>
Date: Sat, 18 Apr 2020 18:12:38 +0100
Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=live.com; dmarc=pass action=none header.from=live.com; dkim=pass header.d=live.com; arc=none
Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Wd0qN0zfIsdc9v8LHpomWs9rLTlA+qV/potMiuuAkD8=; b=MuDvm2tPeJasWwId6iJEm3ZMkN41+LGt4A8mMJ0m+t3EQOV+QRm7Vmwn8DvnLpRI6I4N9nei4Kqk8twt6J0gUZAPCF+Y/ua2WKdMm6IwjwWilfN3h0Vka9PFc47oBoVIF1BteTxg48JBBffCUy8ki1r6meKaprfLJnMM9LlE7Z3iMzQ8PnrfN9vsH1D8Jf//mvzaAWAQtcfvFMODhpFsDSeN6jAtwfA0JkOImEQLMQmvH6EvFuuXSjaPXx0Cb/zkvZXQlzlqf91FYuRSBhTbccOHL9PrEH7q9o+oXCzvuZwRHpw4/KzgpRUdkFIBi4GPw0FxsmZ1jP3frvImESJOgg==
Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=d7VGfgp7r31cxlYghbzyJeid/svHUxY3e8loGFu+5bZBGuRerdAod3i2Z/4auExMKQRemtc7JI3WwfxJ5uac8UE6saF3l+7lzLAraqBXpl2EECKUQ7ptJbC1/d4Za/8DRapW6PLIZ5Q59X68QSif0eL3ZdVY7B+nsY+dnsmSzJ1wBMECkWzjKyC/gwL5o2XSr4A7Y7cnkMDlfh4Z5DPmeC6iHjT9AU22fLlc1cQMPk23E3HE/VEMfUko3EK+LcTqInR195IpUMQgCZWso6t69n9wzTIcb2L+JEK143IgAap+o0rES+kIYpNGXqIRQpTanEh8HSrcYpwPasz3qzHxkA==
Content-language: en-GB
Delivered-to: dclug@xxxxxxxxxxxxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=dcglug.org.uk; s=1586423162; h=Sender:Content-Type: Content-Transfer-Encoding:Reply-To:List-Subscribe:List-Help:List-Post: List-Unsubscribe:List-Id:Subject:MIME-Version:In-Reply-To:Date:Message-ID: From:References:To:Cc:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner:List-Archive; bh=yYz71rOwP3DiA4LiT5a/Fb47mo2aHSPFv6SSWcQ8y3g=; b=qE8j7bOXDI07CyyGf8jN5WLDY FyW2bptXckxg6eSIWqNf1QxdUHRlebkCWs9xomgTWxkkvj//tXNTkaQzeONLrfG1I5F3rqJurHHMY AsX4PVXZOnrllm2jnGAWhxCVinVJB7hnBUsU2wYnhX2NP9x8FfNEEicLocU++THpOtXcs=;
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=live.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Wd0qN0zfIsdc9v8LHpomWs9rLTlA+qV/potMiuuAkD8=; b=m/+yshk9vxxP0rZHjgBd/ITmpOndupGtc8twgB4wMB5/GFDUTNn3kYdn6L2oOR8Oa7pPqseWYu+ICuuwUFgdqLG0shR0AJ01anlUZdYlEDIBvQXKfvOLnSjGixp78Eawlt7GGeKtYN2f7vPd3R12jh3K95GV7azJJ2CovV3XbUbhYN05BCdldB3kxg+Bqpli1PO2EzkWDs7S58mQ0x8vBvXu3DM5takbMpe4Sy5lKZsENzPnrgUxLDqTQ6paAK8uofOtQ/9rxSDD4wVDaTvtkfHnzXyPo9pYTdjuZUci/CH+5EULWL6r1L2YpJ5n30Kes4Tm/355MnEi7ch0dFaCRg==

On 18/04/2020 16:06, Michael Everitt wrote:

Hmm, but how is that affecting routing . MAC is physical layer, that should
theoretically make (virtually!) no difference? What problem is that causing
higher up the stack??

I'll be honest chief, I wasted about 6 hours on this last night before -for the first time in years - giving up before I went crazy. Sometimesyou just have to know when to move on to more pressing issues.Fortunately this is one of those things that got caught early in stagingand so nothing in production is effected - I am going to have to fix itbefore long though, just got to smash the job queue a bit first.

More eyes on is going to be the best way to figure out what's wrong andI've already got a few other people looking it over (not all the configsare mine so it's entirely possible it's nothing I've personally done) soI'll outline the issue as briefly as I can as replicated easily enoughon my two main home workstations which is where I also first noticed theproblem. I'll omit a lot of stuff for brevity.


The stuff:

Ubuntu 19.10 + 20.04 host machines

Both have LACP ethernet bonds across multiple interfaces + separatemanagement LAN, etc

Enterprise switch has matching LAG group + VLANs tagged through

Virtualbox hypervisor on both machines, lots of different VMs (all arebridged to the LACP bond0 but can be switched to other interfaces ortunnels)

Both machines are multihomed with multiple gateways/routes/DNS available
Also lots of egress SSH and VPN tunnels in use on both

The symptoms:

Fire up random VM on either. _Everything_ network related works fine inVMs: internet browsing, wget files, git clone, etc. What does NOT workis the built in operating system package management tools on _some_distros. That's literally the only thing that doesn't work - DNStimeouts everywhere but only for some of the configured repos. Seen sofar on Arch, Mint, Ubuntu and Debian. Slack, Fedora, RHEL and Ubuntu20.04 specifically and Windows are 'immune', haven't tested Macs or BSDor all my Linux VMs yet (too many of them just for a start).


Digging:

Switching between my egress methods whilst keeping the VMs in bridgedmode has almost no discernible effect that can be presumed statisticallysignificant despite this changing their VLAN ID, network/subnet,gateway, DNS and route. Sometimes a slightly different apt mirror timesout but still results in general failure. _Some_ repos still workinstantly every time - including all PPAs and random little personalrepos (including my local ones). Flipping _any_ effected VM from bridgedto NAT mode however instantly fixes them.


Smoking gun?:

On my system any traffic not specifically guided out otherwise "fallsthrough" into my admin VLAN which is automatically dropped into apermanent Wireguard tunnel to my VPN provider. So any "naive" traffic,which the NAT'd VMs are included in, gets routed out through my VPNprovider and not my local ISP connection - this comes obviously enoughwith it's own separate gateway and physically emerges elsewhere (butstill in the UK). Any effected VM works perfectly egressing like this!At least I can rule out that all these mirrors are somehow down, Ididn't think that could be possible after all.


DNS Trouble:

I feel the bug is here specifically, but may be wrong. Generally, everysingle network segment and VLAN is handled specifically by my own localDNS server. There is some complexity handling with it's upstream DNSwhich varies depending on the VLAN feeding it requests (split horizonDNS obviously so internal resources are available to everything butwithout necessarily leaking DNS requests through my ISP for private/worksystems). Obviously I can change the VM client's DNS at will - and have- but this doesn't change anything!


Sanity checking:

I've remoted into some client networks and double checked similar setupsin testing there - same results. But then they are nearly all prettygeographically local. Speaking of which...


Tentative diagnosis:

I *think* what is happening is something like this: the load balancersat the other end are messing me up somehow. It looks like DNS butclearly isn't that simple. Ubuntu/Debian/Mint all use geographicallydistributed load balancers on their big mirrors that not only pick around robin DNS mirror to send back to the client but then that contentis served from a specific box situated "close by" in network cost terms.THAT is the bit that somehow triggers the... fault? Bug? I'm not evenquite sure how to class it. I'm not ruling out a configuration misstepeither - it could even be a freakish combination of all of them.

If I leave my VMs where they should be: on the production LACP bond andVLAN, behind my DNS server and routed through my normal gateway via myISP I can make them work, it's just a pain and obviously it shouldn'tneed any adjustment whatsoever. Some VMs continue working flawlessly ondefault settings pulling from the _same damn mirrors_ (Ubuntu 20.04 isthe stand out weird one here). On effected Ubuntu/Debian/Mint/Arch VMsif I manually edit their repositories away from top level mirrors withload balancers and choose specific servers, they immediately start working.

So to be clear, there doesn't seem to be _anything_ wrong with my setup,even though I've changed a lot of stuff in the last week or so (theenterprise switches are new as are the bonds and some VLANs). I've alsolived and breathed this stuff for years - I know how to configure allthis stuff in my sleep. I also log _everything_ and the logs say nothing.

The big fat glitch seems to be how certain open source top level mirrorsfirstly feed back a different round robin DNS to the client, guiding ittowards say bytemark or mirrors.ac.uk. Then,_depending on the IP of thatclient, a specific server instance is chosen via geolocation. All of mynormal VM traffic egresses out through my ISP so can be geolocated todown here in the South West pretty accurately. That hand off seems to bewhat is breaking in certain circumstances, for certain VMs. If I NAT myVM out it goes through the VPN tunnel and emerges in a London datacentersomewhere - and that gets perfect results. That seems to be the core ofthe problem. Geolocated by a top level load balancing FLOSS mirror tothe South West and are a VM bridged through a LACP VLAN behind a localDNS instance? FAIL. Same VM geolocated by same mirrors to a Londondatacenter? PASS.

Is it the package managers logic that is somehow wrong? After all, anyeffected VM can still access Google/Amazon/everything else on the entireinternet just fine and they're most definitely geographically loadbalanced systems too! It's just the damned package managers. I'd nearlynarrowed it down to blaming some weird Debian specific glitch (note thetrashed VMs are all Debian flavours whilst Windows and Redhat derivativeLinux continue without issues... until the Arch VM turned up broken aswell). I really wanted to blame the newest stuff first - LACPspecifically. But literally everything else is working perfectly. Andthe workaround VMs NAT'd out over the VPN that do resume normalbehaviour? That traffic is passing over the very same LACP bond. Doh!


Honestly mindblown by this one ¯\_(ツ)_/¯

It's on pause for now either way - please bear in mind that I'veactually left out vast amounts of technical detail here as well, itwould take months to explain the whole lot fully! My intuition is thatsomething like this *has* to be my fault really. There's an unforeseenglitch or loop or rogue cache somewhere and eventually I'll find it andcurse myself.

On with some easier jobs for now. Like solving world hunger or provingP=NP.


--
The Mailing List for the Devon & Cornwall LUG
https://mailman.dcglug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/listfaq

Follow-up: Re: [LUG] default ubuntu mirrors
- From: Michael Everitt

References:
- [LUG] default ubuntu mirrors
  - From: comrade meowski
- Re: [LUG] default ubuntu mirrors
  - From: comrade meowski
- Re: [LUG] default ubuntu mirrors
  - From: comrade meowski
- Re: [LUG] default ubuntu mirrors
  - From: Michael Everitt

Prev by Date: Re: [LUG] default ubuntu mirrors
Next by Date: Re: [LUG] default ubuntu mirrors
Previous by thread: Re: [LUG] default ubuntu mirrors
Next by thread: Re: [LUG] default ubuntu mirrors
Index(es):
- Date
- Thread