D&C GLug - Home Page

[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] default ubuntu mirrors

 

On 18/04/20 18:12, comrade meowski wrote:
> On 18/04/2020 16:06, Michael Everitt wrote:
>
>> Hmm, but how is that affecting routing . MAC is physical layer, that should
>> theoretically make (virtually!) no difference? What problem is that causing
>> higher up the stack??
>
> I'll be honest chief, I wasted about 6 hours on this last night before -
> for the first time in years - giving up before I went crazy. Sometimes
> you just have to know when to move on to more pressing issues.
> Fortunately this is one of those things that got caught early in staging
> and so nothing in production is effected - I am going to have to fix it
> before long though, just got to smash the job queue a bit first.
>
> More eyes on is going to be the best way to figure out what's wrong and
> I've already got a few other people looking it over (not all the configs
> are mine so it's entirely possible it's nothing I've personally done) so
> I'll outline the issue as briefly as I can as replicated easily enough on
> my two main home workstations which is where I also first noticed the
> problem. I'll omit a lot of stuff for brevity.
>
> The stuff:
>
> Ubuntu 19.10 + 20.04 host machines
> Both have LACP ethernet bonds across multiple interfaces + separate
> management LAN, etc
> Enterprise switch has matching LAG group + VLANs tagged through
> Virtualbox hypervisor on both machines, lots of different VMs (all are
> bridged to the LACP bond0 but can be switched to other interfaces or
> tunnels)
> Both machines are multihomed with multiple gateways/routes/DNS available
> Also lots of egress SSH and VPN tunnels in use on both
>
> The symptoms:
>
> Fire up random VM on either. _Everything_ network related works fine in
> VMs: internet browsing, wget files, git clone, etc. What does NOT work is
> the built in operating system package management tools on _some_ distros.
> That's literally the only thing that doesn't work - DNS timeouts
> everywhere but only for some of the configured repos. Seen so far on
> Arch, Mint, Ubuntu and Debian. Slack, Fedora, RHEL and Ubuntu 20.04
> specifically and Windows are 'immune', haven't tested Macs or BSD or all
> my Linux VMs yet (too many of them just for a start).
>
> Digging:
>
> Switching between my egress methods whilst keeping the VMs in bridged
> mode has almost no discernible effect that can be presumed statistically
> significant despite this changing their VLAN ID, network/subnet, gateway,
> DNS and route. Sometimes a slightly different apt mirror times out but
> still results in general failure. _Some_ repos still work instantly every
> time - including all PPAs and random little personal repos (including my
> local ones). Flipping _any_ effected VM from bridged to NAT mode however
> instantly fixes them.
>
> Smoking gun?:
>
> On my system any traffic not specifically guided out otherwise "falls
> through" into my admin VLAN which is automatically dropped into a
> permanent Wireguard tunnel to my VPN provider. So any "naive" traffic,
> which the NAT'd VMs are included in, gets routed out through my VPN
> provider and not my local ISP connection - this comes obviously enough
> with it's own separate gateway and physically emerges elsewhere (but
> still in the UK). Any effected VM works perfectly egressing like this! At
> least I can rule out that all these mirrors are somehow down, I didn't
> think that could be possible after all.
>
> DNS Trouble:
>
> I feel the bug is here specifically, but may be wrong. Generally, every
> single network segment and VLAN is handled specifically by my own local
> DNS server. There is some complexity handling with it's upstream DNS
> which varies depending on the VLAN feeding it requests (split horizon DNS
> obviously so internal resources are available to everything but without
> necessarily leaking DNS requests through my ISP for private/work
> systems). Obviously I can change the VM client's DNS at will - and have -
> but this doesn't change anything!
>
> Sanity checking:
>
> I've remoted into some client networks and double checked similar setups
> in testing there - same results. But then they are nearly all pretty
> geographically local. Speaking of which...
>
> Tentative diagnosis:
>
> I *think* what is happening is something like this: the load balancers at
> the other end are messing me up somehow. It looks like DNS but clearly
> isn't that simple. Ubuntu/Debian/Mint all use geographically distributed
> load balancers on their big mirrors that not only pick a round robin DNS
> mirror to send back to the client but then that content is served from a
> specific box situated "close by" in network cost terms. THAT is the bit
> that somehow triggers the... fault? Bug? I'm not even quite sure how to
> class it. I'm not ruling out a configuration misstep either - it could
> even be a freakish combination of all of them.
>
> If I leave my VMs where they should be: on the production LACP bond and
> VLAN, behind my DNS server and routed through my normal gateway via my
> ISP I can make them work, it's just a pain and obviously it shouldn't
> need any adjustment whatsoever. Some VMs continue working flawlessly on
> default settings pulling from the _same damn mirrors_ (Ubuntu 20.04 is
> the stand out weird one here). On effected Ubuntu/Debian/Mint/Arch VMs if
> I manually edit their repositories away from top level mirrors with load
> balancers and choose specific servers, they immediately start working.
>
> So to be clear, there doesn't seem to be _anything_ wrong with my setup,
> even though I've changed a lot of stuff in the last week or so (the
> enterprise switches are new as are the bonds and some VLANs). I've also
> lived and breathed this stuff for years - I know how to configure all
> this stuff in my sleep. I also log _everything_ and the logs say nothing.
>
> The big fat glitch seems to be how certain open source top level mirrors
> firstly feed back a different round robin DNS to the client, guiding it
> towards say bytemark or mirrors.ac.uk. Then,_depending on the IP of that
> client, a specific server instance is chosen via geolocation. All of my
> normal VM traffic egresses out through my ISP so can be geolocated to
> down here in the South West pretty accurately. That hand off seems to be
> what is breaking in certain circumstances, for certain VMs. If I NAT my
> VM out it goes through the VPN tunnel and emerges in a London datacenter
> somewhere - and that gets perfect results. That seems to be the core of
> the problem. Geolocated by a top level load balancing FLOSS mirror to the
> South West and are a VM bridged through a LACP VLAN behind a local DNS
> instance? FAIL. Same VM geolocated by same mirrors to a London
> datacenter? PASS.
>
> Is it the package managers logic that is somehow wrong? After all, any
> effected VM can still access Google/Amazon/everything else on the entire
> internet just fine and they're most definitely geographically load
> balanced systems too! It's just the damned package managers. I'd nearly
> narrowed it down to blaming some weird Debian specific glitch (note the
> trashed VMs are all Debian flavours whilst Windows and Redhat derivative
> Linux continue without issues... until the Arch VM turned up broken as
> well). I really wanted to blame the newest stuff first - LACP
> specifically. But literally everything else is working perfectly. And the
> workaround VMs NAT'd out over the VPN that do resume normal behaviour?
> That traffic is passing over the very same LACP bond. Doh!
>
> Honestly mindblown by this one ¯\_(ツ)_/¯
>
> It's on pause for now either way - please bear in mind that I've actually
> left out vast amounts of technical detail here as well, it would take
> months to explain the whole lot fully! My intuition is that something
> like this *has* to be my fault really. There's an unforeseen glitch or
> loop or rogue cache somewhere and eventually I'll find it and curse myself.
>
> On with some easier jobs for now. Like solving world hunger or proving P=NP.
>
Yeah, I can see all the signs point to something automatic that's supposed
to be "smart" being "dumb" somehow .. and this is where the KISS principle
always wins out, as I'm sure you know. Each layer of complexity *will* work
*flawlessly* in isolation, but as you add the layers up, you add potential
for a new "edge case" to emerge at each new 'stage'. I don't need to tell
you (if even I could) how to 'drill down' as you seem to have a good idea
where you're going (where many of us would struggle to know where to START)
.. so all I can do is wish you 'Good luck' on your investigations.... :p

What happens, out of interest, if you dump all the geo- and round-robin
crap, and hard-code in some known-good stuff? Is that gonna help/hinder any
stable configs?

Attachment: signature.asc
Description: OpenPGP digital signature

-- 
The Mailing List for the Devon & Cornwall LUG
https://mailman.dcglug.org.uk/listinfo/list
FAQ: http://www.dcglug.org.uk/listfaq