Per-zone DNS resolution for homelabs

One of the problems I’ve had with my homelab is the fact that logging into my corporate VPN every day changes my DNS servers, so I cannot resolve homelab DNS. For the past 4+ years I’ve gotten past this using hostfile entries, which is quite annoying when you’re spinning up workloads dynamically.

I posted this question the VMware homelab Slack channel and Steve Tilkens came back with /private/etc/resolver for the Mac. He wrote:

Just create a file in that directory named whatever your lab domain name is (i.e. – “lab.local”) and the contents should contain the following:
nameserver 192.168.0.1
nameserver 192.168.0.2

This didn’t help me on Windows, but immediately helped another employee.

But then I started Googling around for things like ‘/private/etc/resolver for Windows’ and somewhere I found somebody suggesting the Windows NRPT. The first hit on my search was a Scott Lowe blog talking about using the resolver trick on a Mac so if you want a detailed explanation of the Mac stuff, check it out.

Anyway it took me like 10 seconds to open up the Local Group Policy editor (gpedit.msc) on my laptop and configure my laptop to resolve my AD domain via my homelab domain controllers. Years of searching over!

VMC on AWS – VPN, DNS Zones, TTLs

My customer reported an issue with DNS zones in VMC, so I needed it up in the lab to check the behavior. The DNS service in VMC allows you to specify DNS resolvers to forward requests. By default DNS is directed to 8.8.8.8 and 8.8.4.4. If you’re using Active Directory, you generally will set the forwarders to your domain controllers. But some customers need more granular control over DNS forwarding. For example – you could set the default forwarders to domain controllers for vmware.com, but maybe you just acquired Pivotal, and their domain controllers are at pivotal.com. DNS zones allow you direct any request for *.pivotal.com to a different set of DNS servers.

Step 1

First, I needed an IPSEC VPN from my homelab into VMC. I run Ubiquiti gear at home. I decided to go with a policy-based VPN because my team’s VMC lab has many different subnets with lots of overlap on my home network. I went to the Networking & Security tab, Overview screen which gave me the VPN public IP, as well as my appliance subnet. All of the management components, including the DNS resolvers, sit in the appliance subnet. So I will need those available across the VPN. Not shown here is another subnet 10.47.159.0/24, which contains jump hosts for our VMC lab.

I set up the VPN in the Networks section of the UniFi controller – you add a VPN network just like you add any other wired or wireless network. I add the appliance subnet 10.46.192.0/18 and my jump host subnet 10.47.159.0/24. Peer IP is the VPN public IP, and local WAN IP is the public IP at home.

I could not get SHA2 working and never figured out why. Since this was just a temporary lab scenario, I went with SHA1

On the VMC side I created a policy based VPN. I selected the Public Local IP address, which matches the 34.x.x.x VMC VPN IP shown above. The remote Public IP is my public IP at home. Remote networks – 192.168.75.0/24 which contains my laptop, and 192.168.203.0/24 which contains my domain controllers. Then for local networks I added the Appliance subnet 10.46.192.0/18 and 10.47.159.0/24. The show up as their friendly name in the UI.

The VPN comes up.

Now I need to open an inbound firewall rule on the management gateway to allow my on-prem subnets to communicate with vCenter. I populate the Sources object with subnet 192.168.75.0/24 so my laptop can hit vCenter. I also set up a reverse rule (not shown) outbound from vCenter to that same group. This isn’t specifically necessary to get DNS zones to work since we can only do that for the compute gateway – it’s the compute VMs that need the DNS zone. But I wanted to reach vCenter over the VPN.

I create similar rules on the compute gateway to allow communication between my on-prem subnets and anything behind the compute gateway – best practice would be to lock down specific subnets and ports.

I try to ping vCenter 10.46.224.4 from my laptop on 192.168.75.0/24 and it fails. I run a traceroute and I see my first hop is my VPN connection into VMware corporate. I run ‘route print’ on my laptop and see the entire 10.0.0.0/8 is routed to the VPN connection.

This means I will either have to disconnect from the corporate VPN to hit 10.x IP addresses in VMC, or I have to route around the VPN with a static route

At an elevated command prompt, I run these commands

route add 10.46.192.0 mask 255.255.192.0 192.168.75.1 metric 1 -p
route add 10.47.159.0 mask 255.255.255.0 192.168.75.1 metric 1 -p

This inserts two routes into my laptop’s route table. The -p means persistent, so the route will persist across reboots.

Now when I run a route print I can see that for my VMC appliance subnet and jump host subnet.

I can now ping vCenter by private IP from my laptop. I can also ping servers in the jump host subnet.

Now to create a DNS zone – I point to one of my domain controllers on-prem – in production you would of course point to multiple domain controllers.

I flip back to DNS services and edit the Compute Gateway forwarder. The existing DNS forwarders point to our own internal lab domain, and we don’t want to break that communication. What we do want is to have queries destined for my homelab AD redirected to my homelab DNS servers. We add the zone to the FQDN Zones box and click save.

Now we run a test – you can use nslookup, but I downloaded the BIND tools so I can use dig on Windows.

First dig against my homelab domain controller

dig @192.168.203.10 vmctest88.ad.patrickkremer.com

Then against the VMC DNS server 10.46.192.12

dig @10.46.192.12 vmctest88.ad.patrickkremer.com

The correct record appears. You can see the TTL next to the DNS entry at 60 seconds – the VMC DNS server will cache the entry for the TTL that I have configured on-prem. If I dig again, you can see the TTL counting down toward 0.

I do another lookup after the remaining 21 seconds expire and you can see a fresh record was pulled with a new TTL of 60 seconds.

Let’s make a change. I update vmctest88 to point to 192.168.203.88 instead of .188, and I update the TTL to 1 hour.

On-prem results:

VMC results:

This will be cached for 1 hour in VMC.

I switch it back to .188 with a TTL of 60 seconds on-prem, which is reflected instantly

But in VMC, the query still returns the wrong .88 IP, with the TTL timer counting down from 3600 seconds (1 hour)

My customer had the same caching problem problem, except their cached TTL was 3 days and we couldn’t wait for it to fix itself. We needed to clear the DNS resolver cache. In order to do that, we go to the API. A big thank you to my coworker Matty Courtney for helping me get this part working.

You could, of course, do this programmatically. But if consuming APIs in Python isn’t your thing, you can do it from the UI. Go to the Developer Center in the VMC console, then API explorer. Pick your Org and SDDC from the dropdowns.

Click on the NSX VMC Policy API

In the NSX VMC Policy API, find Policy, Networking IP, Management, DNS, Forwarder, then this POST operation on the tier-1 DNS forwarder

Fill out the parameter values:
tier-1-id: cgw
action: clear_cache
enforcement_point: /infra/sites/default/enforcement-points/vmc-enforcementpoint

Click Execute

We see Status: 200, OK – success on the clear cache operation. We do another dig against the VMC DNS server – even though we were still within the old 1 hour cache period, the cache has been cleared. The VMC DNS server pulls the latest record from my homelab, we see the correct .188 IP with a new TTL of 60 seconds.

Windows 2008 primary IP – DNS registration

Update 2011-07-16
We filed a request to have the 2008 R2 solution listed below backported to Windows 2008 SP2, but the request came back denied yesterday. The escalation engineer’s comment: “Considering that Windows Vista and 2008 SP1 hit End of Life this month, the resources were not available for a fix.” That will close the book on this issue – if you need the SkipAsSource functionality in Windows 2008, you have to use 2008 R2.

Update 2011-04-15
Microsoft support advises that there are no plans to bring the ‘netsh int ipv4 show ipaddresses level=verbose’ command to Windows 2008. It will only be available for 2008 R2.

Update 2011-04-10
I can confirm that KB2386184 does work on Windows 2008 R2, the netsh int ip command with the SkipAsSource flag works, and ‘netsh int ipv4 show ipaddresses level=verbose’ does display the flags on all IPs on the box. The issue I ran into was because of how the Hotfix downloader works. We don’t download patches directly from servers, so I was downloading from my workstation which is Windows 7 32-bit. I didn’t notice at the time, but it pushed a 32-bit version of the hotfix down because that’s what it detected on my workstation. When I tried installing the 64-bit version, it (not suprprisingly) worked.

I have our MS rep looking into why the same show ipaddress command doesn’t exist for 2008 SP2, KB975808.

Update 2011-04-06
My change request was closed. MS says they addressed the 2 biggest concerns – setting the SkipAsSource flag and being able to view the flag in the hotfix available at KB2386184. They opened a new ticket to address why I am unable to successfully install the hotfix – when I attempt the installation on 2008 R2, I get an error saying that the hotfix doesn’t apply to my system.

Update 2011-04-02
KB2386184 references a way to at least dump the existing SkipAsSource flag on Windows 7 and Windows 2008 R2, but I was unable to install the hotfix, it said the hotfix didnt apply to my 2008 R2 server. I have a meeting next week with a product manager, hopefully we can come to some resolution.

Update 2010-02-08
Heard back from Microsoft and option “a” below is being considered for backporting into Windows 2008. All of the features listed below are part of Windows 8.

Update 2009-12-17

We have filled out a design change request for Microsoft – we’re not holding our breath on implementation though.

a) There doesn’t appear to be a command to list out the IPs on the box and determine which ones are flagged SkipAsSource. When we have a server that’s reporting intermittent failures to connect to an SMTP server, it is not possible to figure out which IP is the offending IP.
b) The SkipAsSource flag should be in the GUI instead of a command line netsh.
c) You can only set the flag on an IP add, not on an existing IP.
d) We should be able to set the default behavior so new IPs that are added are automatically flagged SkipAsSource.

Update 2009-11-16
Microsoft advised the instructions in their hotfix were not 100% accurate, the syntax of the netsh command was incorrect. It has since been updated. I did get the hotfix to work with one caveat: the name of the interface can not start with a digit – the command fails unless the interface name starts with a letter i.e. interface 10.0.0.x won’t work, but if you name it i10.0.0.x it will work.

It’s not a great solution because 1) You can only use the netsh command to add an IP with the SkipAsSource flag, there’s no way to set it after it’s on the box, you have to delete and add it again 2) You have no command to display which IPs have the SkipAsSourceFlag set, 3) There’s no GUI option, 4) There’s no option for SkipAsSource to be the default setting for all IPs added to the box. But it’s the best we have for now.

2009-11-02
We’ve run into issues with Windows 2008. The first is that every IP on the box gets auto-registered in DNS. This causes issues with some of our monitoring software being unable to handle multiple IPs returned in the DNS query. The second is that there is no concept of “primary” IP on the box anymore. In Windows 2003, the first IP on the box is the primary IP and it’s the default IP for all outbound communication. In what appears to be a random fashion, 2008 picks different IPs for outbound traffic. This is causing problems as we’re having to add firewall rules for every new IP that we add to a box – we never know which IP is going to be used to hit a SQL box or SMTP server.

I found this MS article that explains how an IP is selected. This does seem to explain the behavior; however, it isn’t consistent. I also found a hotfix KB975808 that claims to fix the DNS registration and primary IP issue, but was unable to get it to work. The fix is a “SkipAsSource” flag addition to the netsh command, adding an IP with the flag enabled will force windows to skip the IP for DNS registration and outbound IP selection.

I opened a ticket with Microsoft and we’ll see what happens.