This is part 5 of an 8 part post covering the process used to trace down and correct a problem with semanage login record group matching. If you have not already read the previous parts, you may want to start at the beginning

Isolating the cause

Having successfully determined that the problem lay somewhere in the number of users in the group, I started considering where a bug of that nature might have been introduced.

The 67/68 boundary did not fall on any standard C mistake areas (multiples of 32, unsigned int overflows, ...), so I was a little suspicious of the hard boundary, thinking it was more along the lines of buffer space. Anyway, I wanted to replicate the problem in a more isolated environment to eliminate variables (ldap, sssd, ...) and provide a safe place for more invasive testing. On a clean standalone system I did:

# create 70 users to work with
$ for i in $(seq 01 70) ; do adduser user$i ;done

# and group to put them in
$ groupadd largegroup

# add all of the users to the new group
$ for i in $(seq 01 70) ; do usermod -G largegroup user$i ;done

# set a password for one of the users (so we can test with login)
$ passwd user1

# count the number of users in the group (add 1, just quickly counting commas)
$ getent group largegroup | grep --only-matching ,  | wc -l
69

# setup basic login policy
$ semanage login -a -s staff_u -r s0-s0:c0.c1023 'adminuser'
$ semanage login -m -s user_u -r s0-s0:c0 __default__
$ semanage login -a -s user_u -r s0-s0:c1.c2 '%largegroup'
$ service sshd start
Starting sshd:                                             [  OK  ]

# connect in as 'user1' who is a member of 'largegroup' and should be s0-s0:c1.c2
$ ssh -q -x user1@localhost 'id -a'
user1@localhost's password: 
uid=501(user1) gid=502(user1) groups=502(user1),572(largegroup) context=user_u:user_r:user_t:s0-s0:c0

# did not work..., remove all users from 'largegroup'
$ for i in $(seq 01 70) ; do usermod -G user$i user$i ;done
$ getent group largegroup | grep --only-matching ,  | wc -l
0

# add only 10 users back in
$ for i in $(seq 01 10) ; do usermod -G largegroup user$i ;done
$ getent group largegroup | grep --only-matching ,  | wc -l
9

# try again, now with only 10 members in the group, works correctly
$ ssh -q -x user1@localhost 'id -a'
user1@localhost's password: 
uid=501(user1) gid=502(user1) groups=502(user1),572(largegroup) context=user_u:user_r:user_t:s0-s0:c1,c2

# put 65 members in 'largegroup'
$ for i in $(seq 01 65) ; do usermod -G largegroup user$i ;done
$ getent group largegroup | grep --only-matching ,  | wc -l
64

# still working
$ ssh -q -x user1@localhost 'id -a'
user1@localhost's password: 
uid=501(user1) gid=502(user1) groups=502(user1),572(largegroup) context=user_u:user_r:user_t:s0-s0:c1,c2

# add a 66th member
$ usermod -G largegroup user66
$ getent group largegroup | grep --only-matching ,  | wc -l
65

# still working
$ ssh -q -x user1@localhost 'id -a'
user1@localhost's password: 
uid=501(user1) gid=502(user1) groups=502(user1),572(largegroup) context=user_u:user_r:user_t:s0-s0:c1,c2

# add a 67th member
$ usermod -G largegroup user67
$ getent group largegroup | grep --only-matching ,  | wc -l
66

# breaks, despite being a member of 'largegroup', user1 is no longer coming in as s0-s0:c1,c2
$ ssh -q -x user1@localhost 'id -a'
user1@localhost's password: 
uid=501(user1) gid=502(user1) groups=502(user1),572(largegroup) context=user_u:user_r:user_t:s0-s0:c0

# remove a user from 'largegroup', get it down to 66 members
$ usermod -G user67 user67
$ getent group largegroup | grep --only-matching ,  | wc -l
65

# working again....
$ ssh -q -x user1@localhost 'id -a'
user1@localhost's password: 
uid=501(user1) gid=502(user1) groups=502(user1),572(largegroup) context=user_u:user_r:user_t:s0-s0:c1,c2

As suspected, this broke at a different number of users, meaning it was not a set limit, but probably a buffer size somewhere. To confirm this, I ran through the steps two more times, once with really long usernames and once with really short ones. The long usernames broke at 34/35 and the short at 83/84. Clearly it was dependent on the length of the usernames in a group, meaning it was almost certainly a buffer space issue.

Next up, part 6: Searching for the hypothesized buffer which had been outgrown.