The Problem
The PowerShell Module for NSX-T Policy API on VMware Cloud on AWS is an open source community module contributed by William Lam. The module makes NSX-T calls to VMC on AWS simple and is used by many of our VMC customers to simplify VMC management. Community modules are open source modules supported by the community, though many contributors are VMware employees.
The New-NSXTDistFirewall function allows you to create a distributed firewall rule in your SDDC. A customer recently filed a support ticket reporting intermittent failures creating DFW rules – the TSE (technical support engineer) on the case determined that the errors only happened when using certain groups. Further examination revealed that the customer had more than 1,000 groups, and the errors happened when using groups beyond that first 1,000. The TSE reached out to me to see if I could help. He also reached out to another support engineer who suggested the error might be related to API pagination.
API calls typically have limits on the results they return. Imagine if you had a database with a million rows – you wouldn’t want to dump a million rows worth of data out to JSON – instead, you deliver results in batches. APIs accomplish this with pagination.
The Source Code
Armed with this knowledge – the problem seems to be related to more than 1,000 groups, and might be related to API pagination – I went to grab the module’s source code. I have written several posts on how get your machine ready for code contributions, check out my VEBA post for more details.
The module can be found here in the PowerShell Gallery. A link to the Project Site then takes you to the Github repo hosting the code. I forked the repo:
Then got my clone URL:
Then cloned the fork:
git clone https://github.com/kremerpatrick/VMware.VMC.NSXT.git
I opened the code and searched for the New-NSXTDistFirewall function.
The first thing it does is build a list of destination and source groups, which it does by calling another function – Get-NSXTGroup.
If (-Not $global:nsxtProxyConnection) { Write-error "No NSX-T Proxy Connection found, please use Connect-NSXTProxy" } Else {
$sectionId = (Get-NSXTDistFirewallSection -Name $Section)[0].Id
$destinationGroups = @()
foreach ($group in $DestinationGroup) {
if($group -eq "ANY") {
$destinationGroups = @("ANY")
} else {
$tmp = (Get-NSXTGroup -GatewayType CGW -Name $group).Path
$destinationGroups+= $tmp
}
}
$sourceGroups = @()
foreach ($group in $SourceGroup) {
if($group -eq "ANY") {
$sourceGroups = @("ANY")
} else {
$tmp = (Get-NSXTGroup -GatewayType CGW -Name $group).Path
$sourceGroups+= $tmp
}
}
I found Get-NSXTGroup function, near the top of this function I saw the API URL that the function is calling.
$edgeFirewallGroupsURL = $global:nsxtProxyConnection.Server + "/policy/api/v1/infra/domains/$($GatewayType.toLower())/groups
Given that the problem seems to be related to group size, I thought that my problem was likely related to this API call. Time to build a test environment.
The Test Environment
I already had an existing one-node test SDDC that I was about to delete, so I repurposed it for this.
What better tool to use to build the test environment than the tool that we’re trying to fix? I already had the module installed (Install-Module VMware.VMC.NSXT). I then ran these few lines of PowerCLI after creating my API token.
$RefreshToken = ''
Import-Module VMware.VMC.NSXT
Connect-VMC -RefreshToken $RefreshToken
Connect-NSXTProxy -RefreshToken $RefreshToken -OrgName VMC-SET-TEST -SDDCName pkremer-csp-test
for ( $i=1;$i -lt 1010;$i++) {
$grpname = "Test-"+$i.ToString('0000')
New-NSXTGroup -GatewayType "CGW" -Name $grpname -IPAddress "192.168.100.0/24"
}
The code creates Compute Gateway groups Test-0001 through Test-1009. They all contain the IP subnet 192.168.100.0/24 – I had to add something to them and just picked a random subnet. Having the total count of groups be greater than 1,000 is what’s important, the contents of each group don’t matter.
Here is the last page of the list of CGW groups as show in the UI.
The API
Time to check on the behavior of the API. I chose to use Developer Center as it’s a fast way to run an API call against your SDDC. I clicked on Developer Center, then API Explorer, selected my SDDC from the dropdown, clicked the down carat next to VMware Cloud on AWS, then clicked the NSX VMC Policy API.
The API URL from the code was ‘policy/api/v1/infra/domains/$($GatewayType.toLower())/groups’, so I searched for groups, then I found the API call that matches: /infra/domains/{domain id}/groups
I expanded the API call.
I’m working with compute gateway groups, so I fill out domain-id with ‘cgw’, then click Execute.
I got a response and wanted to check it out in VS Code, I click Download.
A single group looks like this in JSON. There are many groups in the file
{
"expression": [
{
"ip_addresses": [
"192.168.100.0/24"
],
"resource_type": "IPAddressExpression",
"id": "5cdfa453-331c-4d26-846f-eafc34812379",
"path": "/infra/domains/cgw/groups/1ea45cb1-a396-4537-890c-4d2f561d9429/ip-address-expressions/5cdfa453-331c-4d26-846f-eafc34812379",
"relative_path": "5cdfa453-331c-4d26-846f-eafc34812379",
"parent_path": "/infra/domains/cgw/groups/1ea45cb1-a396-4537-890c-4d2f561d9429",
"marked_for_delete": false,
"overridden": false,
"_protection": "NOT_PROTECTED"
}
],
"extended_expression": [],
"reference": false,
"resource_type": "Group",
"id": "1ea45cb1-a396-4537-890c-4d2f561d9429",
"display_name": "Test-0001",
"path": "/infra/domains/cgw/groups/1ea45cb1-a396-4537-890c-4d2f561d9429",
"relative_path": "1ea45cb1-a396-4537-890c-4d2f561d9429",
"parent_path": "/infra/domains/cgw",
"unique_id": "1b7d5048-bc99-4d18-8149-9f2c7a2fdb1b",
"marked_for_delete": false,
"overridden": false,
"_create_time": 1623708730067,
"_create_user": "pkremer@vmware.com",
"_last_modified_time": 1623708730068,
"_last_modified_user": "pkremer@vmware.com",
"_system_owned": false,
"_protection": "NOT_PROTECTED",
"_revision": 0
}
I searched for “resource_type”: “Group” in the text editor and found 1,000 instances. Looks like I found the problem – the API is returning a maximum of 1,000 groups, but we have more than 1,000. This explains the behavior the customer reported – any time they use a group that isn’t in the first 1,000, it doesn’t show up.
I jumped to the bottom of the file and found this:
"result_count": 1010,
"sort_by": "display_name",
"sort_ascending": true,
"cursor": "00041000"
This told me that the API is indeed paginating the results, there are 1,010 total results, sorted ascending by display name, and I can get the next page by invoking the API with the cursor property set to 00041000.
I invoked the API again with the cursor.
Here our my last 10 groups.
The bottom does not give me a total count, and does not give me cursor, so I know I am on the last page.
"sort_by": "display_name",
"sort_ascending": true
Then I tested the Get-NSXTGroup function directly.
$RefreshToken = ''
Import-Module VMware.VMC.NSXT
Connect-VMC -RefreshToken $RefreshToken
Connect-NSXTProxy -RefreshToken $RefreshToken -OrgName VMC-SET-TEST -SDDCName pkremer-csp-test
Get-NSXTGroup -Name "TEST-0950" -GatewayType "CGW"
Get-NSXTGroup -Name "TEST-1005" -GatewayType "CGW"
Get-NSXTGroup -Name "TEST-0951" -GatewayType "CGW
I didn’t get any results for group 1005.
Name : Test-0950
ID : 64902cf0-6a46-40ac-a0e6-9b193231702a
Type : USER_DEFINED
Members : {192.168.100.0/24}
Path : /infra/domains/cgw/groups/64902cf0-6a46-40ac-a0e6-9b193231702a
Name : Test-0951
ID : 31f2c3bf-f5de-4763-8d80-6f80c3740d5d
Type : USER_DEFINED
Members : {192.168.100.0/24}
Path : /infra/domains/cgw/groups/31f2c3bf-f5de-4763-8d80-6f80c3740d5d
At this point I was confident that I need to implement pagination in the function to fix the problem.
The Fix
I first have to remove the existing module
Remove-Module VMware.VMC.NSXT
My testing script now has to change – instead of importing the module I installed from the Gallery, I directly import the PSM1 from the Github repo that I cloned. I remove the module at the end so any changes I make to the module get imported the next time I run my test function.
$RefreshToken = ''
Import-Module C:\git\PowerShell\VMware.VMC.NSXT\VMware.VMC.NSXT.psm1
Connect-VMC -RefreshToken $RefreshToken
Connect-NSXTProxy -RefreshToken $RefreshToken -OrgName VMC-SET-TEST -SDDCName pkremer-csp-test
Get-NSXTGroup -Name "TEST-0950" -GatewayType "CGW"
Get-NSXTGroup -Name "TEST-1005" -GatewayType "CGW"
Get-NSXTGroup -Name "TEST-0951" -GatewayType "CGW
Remove-Module VMware.VMC.NSXT
Now it’s time to change the code to do pagination. Fortunately, there was an example of this in Get-NSXTSegment which I was able to adapt to Get-NSXTGroup. I won’t explain every line of code, but the API URL now has a query string parameter of page_size appended to it. This explicitly defines that I am expecting 1,000 records back for each API call.
$edgeFirewallGroupsURL = $global:nsxtProxyConnection.Server + "/policy/api/v1/infra/domains/$($GatewayType.toLower())/groups?page_size=1000"
Then I added a loop that keeps checking the cursor value and pulling another 1,000 groups at a time until we see the total that we were expecting – the value given to us in the initial call as result_count (1,010 groups) – as long as the running total of groups in $seenGroups is less than the $totalGroupsCount (1,010), the code pulls another 1,000 groups with the cursor value.
while ($seenGroups -lt $totalGroupsCount) {
$groupsURL = $baseEdgeFirewallGroupsURL + "&cursor=$(($requests.Content | ConvertFrom-Json).cursor)"
try {
if($PSVersionTable.PSEdition -eq "Core") {
$requests = Invoke-WebRequest -Uri $groupsURL -Method $method -Headers $global:nsxtProxyConnection.headers -SkipCertificateCheck
} else {
$requests = Invoke-WebRequest -Uri $groupsURL -Method $method -Headers $global:nsxtProxyConnection.headers
}
Once the code has been implemented, the test function returns 3 results – it does not skip over Test-1005.
Retrieving NSX-T Groups …
Name : Test-0950
ID : 64902cf0-6a46-40ac-a0e6-9b193231702a
Type : USER_DEFINED
Members : {192.168.100.0/24}
Path : /infra/domains/cgw/groups/64902cf0-6a46-40ac-a0e6-9b193231702a
Retrieving NSX-T Groups …
Name : Test-1005
ID : 50bcda38-fa4d-4a49-95cb-30ca31d167b0
Type : USER_DEFINED
Members : {192.168.100.0/24}
Path : /infra/domains/cgw/groups/50bcda38-fa4d-4a49-95cb-30ca31d167b0
Retrieving NSX-T Groups …
Name : Test-0951
ID : 31f2c3bf-f5de-4763-8d80-6f80c3740d5d
Type : USER_DEFINED
Members : {192.168.100.0/24}
Path : /infra/domains/cgw/groups/31f2c3bf-f5de-4763-8d80-6f80c3740d5d
After I was satisfied that all tests worked, I bumped the version in the .psd1 file from 1.0.9 to 1.0.10.
The final step was filing a pull request. You can find more documentation on pull requests in this post. A pull request notifies the maintainer (William Lam, in this case) that new code has been contributed. He reviewed it, approved it, merged it, and pushed version 1.0.10 to the PowerShell Gallery.