VMware Event Broker Appliance – Part XII – Advanced Function Troubleshooting

In Part XI, we discussed changing configuration options in the OVA UI. In this post, we will discuss advanced function troubleshooting.

A VEBA bug report came in saying “Hostmaint-alarms fails on hosts where the cluster name contains a hyphen with a preceeding space”. For example, if you have a cluster named CL-2, the function works. But if your cluster is named CL – 2, the function fails. The bug report states the error message is “/root/function/script.ps1 : Cannot process argument because the value of argument “name” is not valid. Change the value of the “name” argument and run the operation again.”

One way to test this is to actually create a cluster with the name “CL – 2” and enter/exit maintenance mode. This grows tiresome when you’re actively debugging a problem – you don’t want to be continually putting a host in and out of maintenance mode.

Another way to do this is to first use the echo function to intercept the payload from the event router in VEBA. The echo function will drop the JSON into the pod’s logfile.

First, deploy the echo function to your VEBA. You will need to change the gateway to match your VEBA. If you open up the current stack.yml file for the host maintenance function, you find the topics are EnteredMaintenanceModeEvent and ExitMaintenanceModeEvent. These are the topics we want in our echo function.

Deploy the echo function with:

faas-cli deploy -f stack.yml –tls-no-verify

Look at Part IV if you need a refresher on deploying functions.

Once the function is deployed, we need to tail (a.k.a. follow in Kubernetes) the echo function logfile. Look at Part VIII for details on tailing logfiles in Kubernetes

kubectl get pods -A
kubectl logs -n openfaas-fn veba-echo-5dcf46b9c-k7v2s --follow

Then do a maintenance mode operation on a host – in this example, I take a host out of maintenance mode. The logfile contains JSON that I can easily copy and paste into VS Code

2020/09/09 23:29:36 Forking fprocess.
2020/09/09 23:29:36 Query
2020/09/09 23:29:36 Path  /
2020/09/09 23:29:36 Duration: 0.054708 seconds
{"id":"3dd3d87f-9d10-408f-8a0f-41bb74faad08","source":"https://vc01.ad.patrickkremer.com/sdk","specversion":"1.0","type":"com.vmware.event.router/event","subject":"ExitMaintenanceModeEvent","time":"2020-09-09T23:29:34.941977022Z","data":{"Key":334033,"ChainId":334030,"CreatedTime":"2020-09-09T23:29:34.911981Z","UserName":"VSPHERE.LOCAL\\Administrator","Datacenter":{"Name":"HomeLab","Datacenter":{"Type":"Datacenter","Value":"datacenter-2"}},"ComputeResource":{"Name":"CL - 2","ComputeResource":{"Type":"ClusterComputeResource","Value":"domain-c6240"}},"Host":{"Name":"esx04.ad.patrickkremer.com","Host":{"Type":"HostSystem","Value":"host-6244"}},"Vm":null,"Ds":null,"Net":null,"Dvs":null,"FullFormattedMessage":"Host esx04.ad.patrickkremer.com in HomeLab has exited maintenance mode","ChangeTag":""},"datacontenttype":"application/json"}

It is unformatted and messy when you paste it, but save it as a JSON file, I named mine test.json.

Then you can run the Document Formatter with Shift-Alt-F. The result is a much more readable JSON file for you!

Once you have your JSON, you can tail the logfile for the hostmaint function. This obviously assumes you already have the hostmaint function installed. If not, check out Part VII.

kubectl get pods -A
kubectl logs -n openfaas-fn  powercli-entermaint-74bfbd7688-w6t6x  --follow

Now log into the OpenFaaS UI at https://your-veba-URL

Click on the powercli-entermaint function, then click on JSON, and paste the JSON into the Request Body. I am pasting the JSON that includes the space-hyphen problem. Click Invoke.

You should get a success 200 code message at the bottom.

Now look at the logfile that we tailed above – the error message appears. We can now easily troubleshoot the JSON by making changes right in the OpenFaaS UI

All I had to do to get the function working is change the cluster name CL – 2 to CL-2 in the JSON.

When I click Invoke again, the function succeeds. Now I know I can reproduce the issue by adding space-hyphen, and I can fix the issue by removing it.

The issue is that PowerShell is interpreting the space-hyphen as a command line argument. The JSON needs to be treated as a single object and not parsed. PowerShell handles this by enclosing arguments in double quotes, so we know we somehow need to get the JSON argument passed in double quotes.

To figure out why it’s currently failing, we look at the function’s Dockerfile.

It tells us that powershell is invoked as ‘pwsh -command’

The fprocess variable tells us how the arguments are invoked – xargs will take whatever gets piped to it and pass it as an argument to the script.

Thank you to PK for showing me another file, template.yml, which also contains an fprocess variable. This overrides the fprocess variable in the Dockerfile and is the location where we actually need to make a change.

But how do we figure out what to change it to?

This command gets us a BASH shell on the pod running the maintenance function

kubectl get pods -A
kubectl exec -i -t -n openfaas-fn powercli-entermaint-74bfbd7688-w6t6x  -- /bin/bash

You can see the first few lines of the maintenance script running inside the pod

Now we want to try manually executing the function inside the pod. First I need the JSON file inside the pod. Oops, no text editor in this pod. curl is there, so if you can put the JSON file on a webserver accessible to the appliance, you can save it inside the pod.

In my case I decided I wanted vi so I could easily edit the file.

tdnf install vim

I open a new file named 1.json in /root and paste my JSON file into it, then save. This version of the file contains the space-hyphen problem.

I directly execute the function in the pod and get the same error that was reported in the bug.

cat 1.json | xargs pwsh ./function/script.ps1

I copy 1.json to 2.json and change the cluster name from CL – 2 to CL-2. This time it succeeds

cat 2.json | xargs pwsh ./function/script.ps1

Now I have reproduced the problem inside the pod and can work on a fix, which involved a bit of trial and error working with all of the options in xargs. I arrived at this command

 xargs -0 -I '{}' pwsh ./function/script.ps1 "'{}'"

The -0 switch ignores blank spaces and problematic characters like newlines. the -I switch gives me the ability to dictate exactly how my argument gets passed. By default xargs just drops the argument in at the end of the line. But in this case, it will first pass double quotes, then the JSON argument, then end with double quotes. And we get what we want – a JSON argument enclosed in double quotes.

Both versions of the JSON file succeed in the pod.

Then I make the change in template.yml. I have to rebuild the function to test it, look at Part VII for details on how to do this. But it works and I can then submit a pull request to fix this in the main VEBA repo.

Changing the SA Lifetime interval for a VMC on AWS VPN connection

The Security Association (SA) lifetime defines how how long a VPN tunnel stay up before swapping out encryption keys. In VMC, the default lifetimes are 86,400 seconds (1 day) for Phase 1 and 3600 (1 hour) for Phase 2.

I had a customer that needed Phase 2 set to 86,400 seconds. Actually, they were using IKEv2 and there really isn’t a Phase 2 with IKEv2, but IKE is a discussion for a different day. Regardless, if you need the tunnel up for 86,400 seconds, you need to configure the setting as shown in this post. This can only be done via API call. I will show you how to do it through the VMC API explorer – you will not need any programming ability to execute the API calls using API explorer.

Log into VMC and go to Developer Center>API Explorer, pick your SDDC from the dropdown in the Environment section, then click on the NSX VMC Policy API.

Search for VPN in the search box, then find Policy,Networking,Network,Services,VPN,Ipsec,Ipsec, Profiles.

Expand the first GET call for /infra/ipsec-vpn-tunnel-profiles

Scroll down until you see the Execute button, click the button to execute the API call. You should get a response of type IPSecVpnTunnelProfileListResult.

Click on the result list to expand the list. The number of profiles will vary by customer – in my lab, we have 11 profiles.

I click on the first one and see my name in it, so I can identify it as the one I created and the one I want to change. I find the key sa_life_time set to 3600 – this is the value that needs to change to 86,400

Click on the down arrow next to the tunnel profile to download the JSON for this tunnel profile. Open it in a text editor and change the value from 3600 to 86400 (no commas in the number).

Now we need to push our changes back to VMC via a PATCH API call. Find the PATCH call under the GET call and expand it.

Paste the entirety of the JSON from your text editor into the TunnelProfile box. You can see that the 86400 is visible. Paste the tunnel ID into the tunnel-profile-id section – you can find the ID shown as “id” in the JSON file. Click execute. If successful, you will get a “Status 200, OK” response.

Now to verify. Find the GET request that takes a tunnel profile ID – this will return just a single tunnel profile instead of all of them.

Pass it the tunnel ID and click Execute. You should get a response with a single tunnel profile object.

Click on the response object and you should find an sa_life_time value of 86400.

Homelab – 2012 to 2019 AD upgrade

I foolishly thought that I would quickly swap out my 2012 domain controllers with 2019 domain controllers, thus beginning a weeks-long saga. I have 2 DCs in my homelab, DC1 and DC2.

Built a new DC, joined to the domain, promoted to a DC (it ran AD prep for me, nice!), transferred FSMO roles (all were on DC1), all looked great! Demoted DC1, all logins failed with ‘Domain Unavailable’.

Uh-oh.

Thankfully I had my Synology backing up my FSMO role holder DC. So I restored it from scratch. I figured I might have missed something obvious so I did it again. Same result.

Ran through all sorts of crazy debugging, ntdsutil commands looking for old metadata to clean up, found some old artifacts that I thought might have been causing the issue, and repeated the process. Same result.

Several weeks later I realized what happened – I had a failing UPS take down my Synology multiple times until I replaced it a few days ago. Guess which VM I never restarted? The Enterprise CA. The CA caused all of this. Or at least most of it. Even after I powered up the CA, I was unable to cleanly transfer all FSMO roles. Everything but the Schema Master transferred cleanly, even though they all transferred cleanly while the CA was down. I had to seize the schema master role and manually delete DC1 from ADUC – thankfully, current versions of AD do the metadata cleanup for you when you delete a DC from ADUC.

In hilarious irony, I specifically built the CA on a member server and not a domain controller to avoid upgrade problems.

In summary:

  1. When you don’t administer AD every day, you forget lots of things
  2. No AD upgrade is easy
  3. Make sure you have a domain controller backup before you start
  4. Turn on your CA
  5. Run repadmin /showrepl and dcdiag BEFORE you start messing with the domain
  6. Run repadmin /showrepl and dcdiag AFTER you add a domain controller and BEFORE you remove old domain controllers
  7. ntdsutil is like COBOL – old and verbose

The Power of Community

I’ve both given to and received from the virtualization community over my career. I passed my first VCAP with help from the vBrownBags. I’ve delivered vBrownBag Tech Talks at VMworld. I’ve been part of the Chicago VMUG as a participant or VMware employee for as long as I can remember. I’ve lost count of how many times I’ve presented content to VMUGs. Community matters to me. The impact of community is immense, and you can’t predict what kind of a positive impact community has in the moment – you can only look back and connect the dots.

This year was different from all others for me in terms of community. I was awarded vExpert status in 2020, primarily because of the body of work I have had the privilege to generate this year, highlighted in my VEBA series. I’ve learned and done things I never imagined I’d be able to do. I wrote code that is running in a VMware open source product. That’s a crazy thought for a presales person. It wouldn’t have been possible without community.

Without William Lam‘s encouragement in December 2019, I would have allowed my lack of development skills and impostor syndrome to stop me from even considering contributing to an open source project . Without his guidance, I would never have been able to make some of the code changes I made.

I would have been forever lost in an ocean of Git and Kubernetes without Michael Gasch‘s willingness to spend time teaching me.

PK spent hours teaching me how to write a modern website, enabling me to contribute heavily to the VEBA website.

I wouldn’t have applied for vExpert without encouragement from Robert Guske and Vladimir Velikov.

I doubt she remembers, because it was just an honest comment, but Frankie Gold wrote in Slack “That [file format is] so confusing–which makes your blog post that much more valuable…” That really stuck with me – if this Golang wizard thinks I have a valuable contribution, then I must have something valuable to contribute.

I’ve been using Martin Dekov‘s MTU fixer function to help myself learn Python.

Find your community. Contribute to your community. You help it grow as it helps you grow.

My remotely proctored VMware exam experience

After taking my AWS SAA certification via remote Pearson Vue proctoring, I wanted to audit the experience from the VMware perspective to help validate that our customers are getting a good experience. I wasn’t sure which exam to register for, but since I’m a VMC on AWS SE, I decided to give the VMware Cloud on AWS Master Services Competency Specialist a shot. Fortunately, working with VMC on AWS every day for a year was very good prep for this exam, and I passed, earning the Master Specialist badge for VMware Cloud on AWS 2020.

For the most part any Pearson Vue exam is the same – same testing engine, same blue background. I expected a similar experience to the AWS exam. 

  • Prior to talking to a registration person, you have to complete a check-in process. After you log into Vue and indicate you’re ready to start your exam, you are taken to a check-in screen. You can input your cellphone number into the screen and Vue will text you a link, or you can just type the URL manually.  You have to take front and back photos of your ID, a photo of your face, and a view of all 4 sides of your seating area. Once you’ve submitted those photos from your phone, you can continue checking in on the website. You could use your webcam and avoid the cellphone process altogether but it would be tough to get all of the photos you need with webcam.
  • You get checked in by a registration person, they are not the exam proctor. The registration person can see you on your webcam and they provide you either chat or verbal instructions.

  • They want your desk empty – pens, pencils, headsets, everything. The only thing you should have on your desk is a mouse, keyboard, laptop if necessary, monitor, and webcam. Unlike last time, the staffer didn’t care about my laptop being on the desk and didn’t question what my docking station was.

  • You’re going to have to use your webcam to show them around the room, so be prepared to take down a monitor-mounted one or to spin your laptop around.

  • If they see or hear somebody in the room with you, it’s an instant fail. Make sure you are in a room with a locked door.

  • Unplug any dual-or-greater monitor configuration before you get into the exam. Only a single monitor is allowed. Also unplug any other monitor in the room – my homelab rack and its console monitor are in the office, so it was flagged by Pearson as a problem.

  • There is no scratch paper, no dry-erase board, your only option is an electronic notepad built into the testing software. It wasn’t a big deal for this exam but I could see this being a problem for calculations and larger design problems, at least for me – I like to write things down.

  • Unlike my last exam, the proctors were immediately responsive to chat requests for help. I tested this multiple times with quick responses.

  • The process was quite a bit smoother this time around, and it surely beats driving to an exam center.

  • Once you’re in the exam it feels pretty much like taking an exam at any other test center.

VMware Event Broker Appliance – Part XI – Changing options in the OVA installer

In Part X, we talked about building the VEBA OVA from source code. In this post, I will explain the change I made that required me to rebuild the appliance.

It was a relatively simple change – although it’s best practice to keep SSH turned off, I deploy a LOT of VEBA appliances. I’m always doing some kind of testing to do my part to contribute this open source project. I usually have to turn on SSH to do what I need to do with the appliance, so I wanted a way to have SSH automatically enabled.

This is a screenshot from the v0.4 appliance that has my change included – just a simple “Enable SSH” checkbox.

If you would like to check out the pull request, you can find it here. There were five files that needed to be changed. I am pasting screenshots from the PR on Github, the PR shows you all changes made to the code.

manual/photon.xml.template. This file defines all available properties in the OVA. I have named my property ‘enable_ssh’.

test/deploy_veba_eventbridge_processor.sh
test/deploy_veba_openfaas_processor.sh

– These fiiles are used for automated deployments of the appliance in either EventBridge or OpenFaaS mode. You can see the VEBA_NOPROXY line in the EventBridge file where I deleted some inadvertent spacing that I introduced in a prior PR. The change for the SSH feature included adding the default value of False to enable SSH, then adding a line of code to push the value into the OVF for deployment.

files/setup.sh – This file extracts the values input by the user into the OVA and places them into variables for use during the rest of the appliance setup scripts. I

files/setup-01-os.sh – There are 9 different shell scripts in the files folder that perform various configuration tasks when the appliance is deployed.

In the OS setup file, I removed the default code that stopped and disabled SSHD. Instead, I perform an ifcheck on the ENABLE_SSH variable and start it if the box is checked.

After I made all of the code changes, I then built the appliance as shown in Part X to test. Once everything worked, I filed the PR to get my changes incorporated into the product. Special thanks to William Lam for teaching me how this process works.

That’s all for this post. In Part VII, we will learn about advanced function troubleshooting.

My remotely proctored AWS Certified Solution Architect – Associate exam experience

I recertified my AWS SAA certification this past Friday. It was the older SAA-C01 exam which expires on June 30th, it has been replaced by the SAA-C02 exam. The SAA-C01 seemed harder than the SAA-C00 exam that I took 3 years ago, there seemed to be more speeds and feeds memorization that I thought the original exam did not focus on.

Thank you to A Cloud Guru which is the only tool I used to pass the exam.

This is the first time I ever took a remotely proctored exam. A couple of points from this experience:

  • You get checked in by a registration person, they are not the exam proctor. The registration person can see your webcam, they talk to you and give you instructions.
  • They’re not kidding when they say they want your desk empty – literally nothing but a keyboard and mouse. They even complained about my laptop being on the desk – it was docked and it took a fair amount of convincing to let me continue.
  • You’re going to have to use your webcam to show them around the room, so be prepared to take down a monitor-mounted one or to spin your laptop around.
  • If they see or hear somebody in the room with you, it’s an instant fail. Make sure you are in a room with a locked door.
  • Do yourself a favor and disconnect and unplug any dual-or-greater monitor configuration before you get into the exam.
  • There is no scratch paper, no dry-erase board, your only option is an electronic notepad built into the testing software. It wasn’t a big deal for this exam but I could see this being a big problem for subnetting on a CCNA exam, or calculations on a VCAP design exam.
  • I got the feeling that nobody was really minding the store. I had an issue and clicked on the “chat” for help, it took somebody like 15 minutes to respond, they said they would call my computer but they never did.
  • Overall it wasn’t all that bad, and it surely was nice to not have to drive back and forth to a testing center. I would do it for another exam where I wasn’t expecting to have to write notes.

Changing a VMC segment type from the Developer Center

Here’s a way to use the API explorer to test out API calls. This is one technique to figure out how the API works before you start writing code in your favorite language.

First I create a disconnected segment in my SDDC

Then I go to the Developer Center in the VMC console, pick API Explorer, NSX VMC Policy API, and pick my SDDC from the dropdown.

Now I need a list of all segments – I find this in /infra/tier-1s/{tier-1-id}/segments

I provide the value ‘cgw’ for tier-1-id and click Execute

It’s easiest to view the results by clicking the down arrow to download the resulting JSON file.

Inside the file I find the section containing my test segment ‘pkremer-api-test’.

        {
            "type": "DISCONNECTED",
            "subnets": [
                {
                    "gateway_address": "192.168.209.1/24",
                    "network": "192.168.209.0/24"
                }
            ],
            "connectivity_path": "/infra/tier-1s/cgw",
            "advanced_config": {
                "hybrid": false,
                "local_egress": false,
                "connectivity": "OFF"
            },
            "resource_type": "Segment",
            "id": "15d1e170-af67-11ea-9b05-2bf145bf35c8",
            "display_name": "pkremer-api-test",
            "path": "/infra/tier-1s/cgw/segments/15d1e170-af67-11ea-9b05-2bf145bf35c8",
            "relative_path": "15d1e170-af67-11ea-9b05-2bf145bf35c8",
            "parent_path": "/infra/tier-1s/cgw",
            "marked_for_delete": false,
            "_create_user": "pkremer@vmware.com",
            "_create_time": 1592266812761,
            "_last_modified_user": "pkremer@vmware.com",
            "_last_modified_time": 1592266812761,
            "_system_owned": false,
            "_protection": "NOT_PROTECTED",
            "_revision": 0
        }

Now I need to update the segment to routed, which I do by finding PATCH /infra/tier-1s/{tier-1-id}/segments/{segment-id}. I fill in the tier-1-id and segment-id values as shown (the segment-id came from the JSON output above as the “id”.

This is code that I pasted in the Segment box

{
    "type": "ROUTED",
    "subnets": [
     {
        "gateway_address": "192.168.209.1/24",
        "network": "192.168.209.0/24"
     }
     ],
     "advanced_config": {
         "connectivity": "ON"
      },
     "display_name": "pkremer-api-test"
}

The segment is now a routed segment.

Per-zone DNS resolution for homelabs

One of the problems I’ve had with my homelab is the fact that logging into my corporate VPN every day changes my DNS servers, so I cannot resolve homelab DNS. For the past 4+ years I’ve gotten past this using hostfile entries, which is quite annoying when you’re spinning up workloads dynamically.

I posted this question the VMware homelab Slack channel and Steve Tilkens came back with /private/etc/resolver for the Mac. He wrote:

Just create a file in that directory named whatever your lab domain name is (i.e. – “lab.local”) and the contents should contain the following:
nameserver 192.168.0.1
nameserver 192.168.0.2

This didn’t help me on Windows, but immediately helped another employee.

But then I started Googling around for things like ‘/private/etc/resolver for Windows’ and somewhere I found somebody suggesting the Windows NRPT. The first hit on my search was a Scott Lowe blog talking about using the resolver trick on a Mac so if you want a detailed explanation of the Mac stuff, check it out.

Anyway it took me like 10 seconds to open up the Local Group Policy editor (gpedit.msc) on my laptop and configure my laptop to resolve my AD domain via my homelab domain controllers. Years of searching over!

VMC on AWS – VPN, DNS Zones, TTLs

My customer reported an issue with DNS zones in VMC, so I needed it up in the lab to check the behavior. The DNS service in VMC allows you to specify DNS resolvers to forward requests. By default DNS is directed to 8.8.8.8 and 8.8.4.4. If you’re using Active Directory, you generally will set the forwarders to your domain controllers. But some customers need more granular control over DNS forwarding. For example – you could set the default forwarders to domain controllers for vmware.com, but maybe you just acquired Pivotal, and their domain controllers are at pivotal.com. DNS zones allow you direct any request for *.pivotal.com to a different set of DNS servers.

Step 1

First, I needed an IPSEC VPN from my homelab into VMC. I run Ubiquiti gear at home. I decided to go with a policy-based VPN because my team’s VMC lab has many different subnets with lots of overlap on my home network. I went to the Networking & Security tab, Overview screen which gave me the VPN public IP, as well as my appliance subnet. All of the management components, including the DNS resolvers, sit in the appliance subnet. So I will need those available across the VPN. Not shown here is another subnet 10.47.159.0/24, which contains jump hosts for our VMC lab.

I set up the VPN in the Networks section of the UniFi controller – you add a VPN network just like you add any other wired or wireless network. I add the appliance subnet 10.46.192.0/18 and my jump host subnet 10.47.159.0/24. Peer IP is the VPN public IP, and local WAN IP is the public IP at home.

I could not get SHA2 working and never figured out why. Since this was just a temporary lab scenario, I went with SHA1

On the VMC side I created a policy based VPN. I selected the Public Local IP address, which matches the 34.x.x.x VMC VPN IP shown above. The remote Public IP is my public IP at home. Remote networks – 192.168.75.0/24 which contains my laptop, and 192.168.203.0/24 which contains my domain controllers. Then for local networks I added the Appliance subnet 10.46.192.0/18 and 10.47.159.0/24. The show up as their friendly name in the UI.

The VPN comes up.

Now I need to open an inbound firewall rule on the management gateway to allow my on-prem subnets to communicate with vCenter. I populate the Sources object with subnet 192.168.75.0/24 so my laptop can hit vCenter. I also set up a reverse rule (not shown) outbound from vCenter to that same group. This isn’t specifically necessary to get DNS zones to work since we can only do that for the compute gateway – it’s the compute VMs that need the DNS zone. But I wanted to reach vCenter over the VPN.

I create similar rules on the compute gateway to allow communication between my on-prem subnets and anything behind the compute gateway – best practice would be to lock down specific subnets and ports.

I try to ping vCenter 10.46.224.4 from my laptop on 192.168.75.0/24 and it fails. I run a traceroute and I see my first hop is my VPN connection into VMware corporate. I run ‘route print’ on my laptop and see the entire 10.0.0.0/8 is routed to the VPN connection.

This means I will either have to disconnect from the corporate VPN to hit 10.x IP addresses in VMC, or I have to route around the VPN with a static route

At an elevated command prompt, I run these commands

route add 10.46.192.0 mask 255.255.192.0 192.168.75.1 metric 1 -p
route add 10.47.159.0 mask 255.255.255.0 192.168.75.1 metric 1 -p

This inserts two routes into my laptop’s route table. The -p means persistent, so the route will persist across reboots.

Now when I run a route print I can see that for my VMC appliance subnet and jump host subnet.

I can now ping vCenter by private IP from my laptop. I can also ping servers in the jump host subnet.

Now to create a DNS zone – I point to one of my domain controllers on-prem – in production you would of course point to multiple domain controllers.

I flip back to DNS services and edit the Compute Gateway forwarder. The existing DNS forwarders point to our own internal lab domain, and we don’t want to break that communication. What we do want is to have queries destined for my homelab AD redirected to my homelab DNS servers. We add the zone to the FQDN Zones box and click save.

Now we run a test – you can use nslookup, but I downloaded the BIND tools so I can use dig on Windows.

First dig against my homelab domain controller

dig @192.168.203.10 vmctest88.ad.patrickkremer.com

Then against the VMC DNS server 10.46.192.12

dig @10.46.192.12 vmctest88.ad.patrickkremer.com

The correct record appears. You can see the TTL next to the DNS entry at 60 seconds – the VMC DNS server will cache the entry for the TTL that I have configured on-prem. If I dig again, you can see the TTL counting down toward 0.

I do another lookup after the remaining 21 seconds expire and you can see a fresh record was pulled with a new TTL of 60 seconds.

Let’s make a change. I update vmctest88 to point to 192.168.203.88 instead of .188, and I update the TTL to 1 hour.

On-prem results:

VMC results:

This will be cached for 1 hour in VMC.

I switch it back to .188 with a TTL of 60 seconds on-prem, which is reflected instantly

But in VMC, the query still returns the wrong .88 IP, with the TTL timer counting down from 3600 seconds (1 hour)

My customer had the same caching problem problem, except their cached TTL was 3 days and we couldn’t wait for it to fix itself. We needed to clear the DNS resolver cache. In order to do that, we go to the API. A big thank you to my coworker Matty Courtney for helping me get this part working.

You could, of course, do this programmatically. But if consuming APIs in Python isn’t your thing, you can do it from the UI. Go to the Developer Center in the VMC console, then API explorer. Pick your Org and SDDC from the dropdowns.

Click on the NSX VMC Policy API

In the NSX VMC Policy API, find Policy, Networking IP, Management, DNS, Forwarder, then this POST operation on the tier-1 DNS forwarder

Fill out the parameter values:
tier-1-id: cgw
action: clear_cache
enforcement_point: /infra/sites/default/enforcement-points/vmc-enforcementpoint

Click Execute

We see Status: 200, OK – success on the clear cache operation. We do another dig against the VMC DNS server – even though we were still within the old 1 hour cache period, the cache has been cleared. The VMC DNS server pulls the latest record from my homelab, we see the correct .188 IP with a new TTL of 60 seconds.