Kubelet驱逐机制到Linux内核内存回收完整流程解析

2025-08-25 约 9500 字预计阅读 19 分钟 - 次阅读

Kubelet驱逐机制到Linux内核内存回收完整流程解析

概述

本文档详细解析了从Kubelet驱逐管理器(evictionManager)启动到最终触发Linux内核内存回收机制的完整流程。这个过程涉及多个层次的协作，从用户态的Kubernetes组件到内核态的内存管理机制。

1. Kubelet启动阶段 - evictionManager初始化

1.1 启动入口

文件位置: pkg/kubelet/kubelet.go#L1388-1389

1
2
// 在kubelet的Run方法中启动evictionManager
kl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, podCleanedUpFunc, evictionMonitoringPeriod)

1.2 evictionManager创建过程

在NewMainKubelet函数中，evictionManager通过以下方式创建：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// 创建驱逐管理器
evictionManager, evictionAdmitHandler := eviction.NewManager(
    klet.resourceAnalyzer,
    evictionConfig,
    killPodNow(klet.podWorkers, klet.recorder),
    klet.mirrorPodClient.GetMirrorPodByPod,
    klet.imageManager,
    klet.containerGC,
    klet.recorder,
    nodeRef,
    klet.clock,
)

关键参数说明:

resourceAnalyzer: 提供系统资源统计信息
evictionConfig: 驱逐配置，包含阈值设置
killPodNow: Pod终止函数
imageManager: 镜像垃圾回收器
containerGC: 容器垃圾回收器

2. 驱逐管理器启动流程

2.1 Start方法实现

文件位置: pkg/kubelet/eviction/eviction_manager.go#L177-206

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) {
    // 1. 创建阈值处理器
    thresholdHandler := func(message string) {
        klog.InfoS(message)
        m.synchronize(diskInfoProvider, podFunc)
    }
    
    // 2. 如果启用内核内存cgroup通知
    if m.config.KernelMemcgNotification {
        for _, threshold := range m.config.Thresholds {
            if threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable {
                // 创建内存阈值通知器
                notifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler)
                if err != nil {
                    klog.InfoS("Eviction manager: failed to create memory threshold notifier", "err", err)
                } else {
                    // 启动通知器
                    go notifier.Start()
                    m.thresholdNotifiers = append(m.thresholdNotifiers, notifier)
                }
            }
        }
    }
    
    // 3. 启动主监控循环
    go func() {
        for {
            if evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {
                klog.InfoS("Eviction manager: pods evicted, waiting for pod to be cleaned up", "pods", format.Pods(evictedPods))
                m.waitForPodsCleanup(podCleanedUpFunc, evictedPods)
            } else {
                time.Sleep(monitoringInterval)
            }
        }
    }()
}

启动流程关键步骤:

创建阈值处理器: 当内存阈值被触发时调用synchronize方法
初始化内存通知器: 如果启用KernelMemcgNotification，为每个内存相关阈值创建通知器
启动监控循环: 定期执行synchronize方法检查资源使用情况

3. 内存阈值通知器机制

3.1 NewMemoryThresholdNotifier创建过程

文件位置: pkg/kubelet/eviction/memory_threshold_notifier.go#L44-65

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func NewMemoryThresholdNotifier(threshold evictionapi.Threshold, cgroupRoot string, factory NotifierFactory, handler func(string)) (ThresholdNotifier, error) {
    // 1. 获取cgroup子系统信息
    cgroups, err := cm.GetCgroupSubsystems()
    if err != nil {
        return nil, err
    }
    
    // 2. 找到memory cgroup挂载点
    cgpath, found := cgroups.MountPoints["memory"]
    if !found || len(cgpath) == 0 {
        return nil, fmt.Errorf("memory cgroup mount point not found")
    }
    
    // 3. 如果是allocatable阈值，指向allocatable cgroup
    if isAllocatableEvictionThreshold(threshold) {
        cgpath += cgroupRoot
    }
    
    return &memoryThresholdNotifier{
        threshold:  threshold,
        cgroupPath: cgpath,
        events:     make(chan struct{}),
        handler:    handler,
        factory:    factory,
    }, nil
}

3.2 UpdateThreshold方法 - 动态阈值计算

文件位置: pkg/kubelet/eviction/memory_threshold_notifier.go#L75-105

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
func (m *memoryThresholdNotifier) UpdateThreshold(summary *statsapi.Summary) error {
    // 1. 获取内存统计信息
    memoryStats := summary.Node.Memory
    if isAllocatableEvictionThreshold(m.threshold) {
        allocatableContainer, err := getSysContainer(summary.Node.SystemContainers, statsapi.SystemContainerPods)
        if err != nil {
            return err
        }
        memoryStats = allocatableContainer.Memory
    }
    
    // 2. 计算内存阈值
    // 设置阈值为: capacity - eviction_hard + inactive_file
    // 目的是当working_set = capacity - eviction_hard时收到通知
    inactiveFile := resource.NewQuantity(int64(*memoryStats.UsageBytes-*memoryStats.WorkingSetBytes), resource.BinarySI)
    capacity := resource.NewQuantity(int64(*memoryStats.AvailableBytes+*memoryStats.WorkingSetBytes), resource.BinarySI)
    evictionThresholdQuantity := evictionapi.GetThresholdQuantity(m.threshold.Value, capacity)
    memcgThreshold := capacity.DeepCopy()
    memcgThreshold.Sub(*evictionThresholdQuantity)
    memcgThreshold.Add(*inactiveFile)
    
    // 3. 创建新的cgroup通知器
    if m.notifier != nil {
        m.notifier.Stop()
    }
    newNotifier, err := m.factory.NewCgroupNotifier(m.cgroupPath, memoryUsageAttribute, memcgThreshold.Value())
    if err != nil {
        return err
    }
    m.notifier = newNotifier
    
    // 4. 启动通知器
    go m.notifier.Start(m.events)
    return nil
}

4. Linux Cgroup通知器实现

4.1 NewCgroupNotifier创建过程

文件位置: pkg/kubelet/eviction/threshold_notifier_linux.go#L48-99

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
func NewCgroupNotifier(path, attribute string, threshold int64) (CgroupNotifier, error) {
    var watchfd, eventfd, epfd, controlfd int
    var err error
    
    // 1. 打开监控文件(memory.usage_in_bytes)
    watchfd, err = unix.Open(fmt.Sprintf("%s/%s", path, attribute), unix.O_RDONLY|unix.O_CLOEXEC, 0)
    if err != nil {
        return nil, err
    }
    defer unix.Close(watchfd)
    
    // 2. 打开控制文件(cgroup.event_control)
    controlfd, err = unix.Open(fmt.Sprintf("%s/cgroup.event_control", path), unix.O_WRONLY|unix.O_CLOEXEC, 0)
    if err != nil {
        return nil, err
    }
    defer unix.Close(controlfd)
    
    // 3. 创建eventfd
    eventfd, err = unix.Eventfd(0, unix.EFD_CLOEXEC)
    if err != nil {
        return nil, err
    }
    
    // 4. 创建epoll实例
    epfd, err = unix.EpollCreate1(unix.EPOLL_CLOEXEC)
    if err != nil {
        return nil, err
    }
    
    // 5. 配置cgroup事件监控
    config := fmt.Sprintf("%d %d %d", eventfd, watchfd, threshold)
    _, err = unix.Write(controlfd, []byte(config))
    if err != nil {
        return nil, err
    }
    
    return &linuxCgroupNotifier{
        eventfd: eventfd,
        epfd:    epfd,
        stop:    make(chan struct{}),
    }, nil
}

4.2 Start方法 - 事件监听循环

文件位置: pkg/kubelet/eviction/threshold_notifier_linux.go#L101-133

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
func (n *linuxCgroupNotifier) Start(eventCh chan<- struct{}) {
    // 1. 将eventfd添加到epoll监控
    err := unix.EpollCtl(n.epfd, unix.EPOLL_CTL_ADD, n.eventfd, &unix.EpollEvent{
        Fd:     int32(n.eventfd),
        Events: unix.EPOLLIN,
    })
    if err != nil {
        klog.InfoS("Eviction manager: error adding epoll eventfd", "err", err)
        return
    }
    
    // 2. 事件监听循环
    for {
        select {
        case <-n.stop:
            return
        default:
        }
        
        // 3. 等待epoll事件
        event, err := wait(n.epfd, n.eventfd, notifierRefreshInterval)
        if err != nil {
            klog.InfoS("Eviction manager: error while waiting for memcg events", "err", err)
            return
        } else if !event {
            continue // 超时，继续等待
        }
        
        // 4. 消费eventfd事件
        buf := make([]byte, eventSize)
        _, err = unix.Read(n.eventfd, buf)
        if err != nil {
            klog.InfoS("Eviction manager: error reading memcg events", "err", err)
            return
        }
        
        // 5. 发送通知到事件通道
        eventCh <- struct{}{}
    }
}

5. 驱逐控制循环 - synchronize方法

5.1 主要执行流程

文件位置: pkg/kubelet/eviction/eviction_manager.go#L231-393

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
    // 1. 检查配置的阈值
    thresholds := m.config.Thresholds
    if len(thresholds) == 0 && !utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {
        return nil
    }
    
    // 2. 获取系统资源统计
    activePods := podFunc()
    updateStats := true
    summary, err := m.summaryProvider.Get(updateStats)
    if err != nil {
        klog.ErrorS(err, "Eviction manager: failed to get summary stats")
        return nil
    }
    
    // 3. 更新阈值通知器
    if m.clock.Since(m.thresholdsLastUpdated) > notifierRefreshInterval {
        m.thresholdsLastUpdated = m.clock.Now()
        for _, notifier := range m.thresholdNotifiers {
            if err := notifier.UpdateThreshold(summary); err != nil {
                klog.InfoS("Eviction manager: failed to update notifier", "notifier", notifier.Description(), "err", err)
            }
        }
    }
    
    // 4. 进行信号观察和阈值检查
    observations, statsFunc := makeSignalObservations(summary)
    thresholds = thresholdsMet(thresholds, observations, false)
    
    // 5. 检查之前未解决的阈值
    if len(m.thresholdsMet) > 0 {
        thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)
        thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)
    }
    
    // 6. 跟踪阈值首次观察时间
    now := m.clock.Now()
    thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)
    
    // 7. 更新节点条件
    nodeConditions := nodeConditions(thresholds)
    nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)
    nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)
    
    // 8. 确定需要驱逐的阈值(满足宽限期)
    thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)
    
    // 9. 更新内部状态
    m.Lock()
    m.nodeConditions = nodeConditions
    m.thresholdsFirstObservedAt = thresholdsFirstObservedAt
    m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt
    m.thresholdsMet = thresholds
    thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)
    m.lastObservations = observations
    m.Unlock()
    
    // 10. 本地存储驱逐检查
    if utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {
        if evictedPods := m.localStorageEviction(activePods, statsFunc); len(evictedPods) > 0 {
            return evictedPods
        }
    }
    
    // 11. 如果没有阈值被触发，返回
    if len(thresholds) == 0 {
        klog.V(3).InfoS("Eviction manager: no resources are starved")
        return nil
    }
    
    // 12. 按驱逐优先级排序阈值
    sort.Sort(byEvictionPriority(thresholds))
    thresholdToReclaim, resourceToReclaim, foundAny := getReclaimableThreshold(thresholds)
    if !foundAny {
        return nil
    }
    
    // 13. 记录驱逐事件
    m.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim)
    
    // 14. 尝试节点级资源回收
    if m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) {
        klog.InfoS("Eviction manager: able to reduce resource pressure without evicting pods.", "resourceName", resourceToReclaim)
        return nil
    }
    
    // 15. 必须驱逐Pod
    klog.InfoS("Eviction manager: must evict pod(s) to reclaim", "resourceName", resourceToReclaim)
    
    // 16. 对Pod进行排序
    rank, ok := m.signalToRankFunc[thresholdToReclaim.Signal]
    if !ok {
        klog.ErrorS(nil, "Eviction manager: no ranking function for signal", "threshold", thresholdToReclaim.Signal)
        return nil
    }
    
    if len(activePods) == 0 {
        klog.ErrorS(nil, "Eviction manager: eviction thresholds have been met, but no pods are active to evict")
        return nil
    }
    
    // 17. 对运行中的Pod进行驱逐排序
    rank(activePods, statsFunc)
    
    // 18. 记录指标
    for _, t := range thresholds {
        timeObserved := observations[t.Signal].time
        if !timeObserved.IsZero() {
            metrics.EvictionStatsAge.WithLabelValues(string(t.Signal)).Observe(metrics.SinceInSeconds(timeObserved.Time))
        }
    }
    
    // 19. 每次驱逐间隔最多杀死一个Pod
    for i := range activePods {
        pod := activePods[i]
        gracePeriodOverride := int64(0)
        if !isHardEvictionThreshold(thresholdToReclaim) {
            gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
        }
        message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc)
        if m.evictPod(pod, gracePeriodOverride, message, annotations) {
            metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()
            return []*v1.Pod{pod}
        }
    }
    
    klog.InfoS("Eviction manager: unable to evict any pods from the node")
    return nil
}

6. Pod驱逐执行

6.1 evictPod方法实现

文件位置: pkg/kubelet/eviction/eviction_manager.go#L555-578

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func (m *managerImpl) evictPod(pod *v1.Pod, gracePeriodOverride int64, evictMsg string, annotations map[string]string) bool {
    // 1. 检查是否为关键Pod
    if kubelettypes.IsCriticalPod(pod) {
        klog.ErrorS(nil, "Eviction manager: cannot evict a critical pod", "pod", klog.KObj(pod))
        return false
    }
    
    // 2. 设置Pod状态
    status := v1.PodStatus{
        Phase:   v1.PodFailed,
        Message: evictMsg,
        Reason:  Reason,
    }
    
    // 3. 记录驱逐事件
    m.recorder.AnnotatedEventf(pod, annotations, v1.EventTypeWarning, Reason, evictMsg)
    
    // 4. 调用killPodFunc终止Pod
    err := m.killPodFunc(pod, status, &gracePeriodOverride)
    if err != nil {
        klog.ErrorS(err, "Eviction manager: pod failed to evict", "pod", klog.KObj(pod))
    } else {
        klog.InfoS("Eviction manager: pod is evicted successfully", "pod", klog.KObj(pod))
    }
    return true
}

7. 连接到Linux内核内存回收机制

7.1 cgroup.event_control机制

当kubelet通过cgroup.event_control注册内存阈值监控时，Linux内核会：

监控内存使用: 内核持续监控指定cgroup的memory.usage_in_bytes文件
阈值检查: 当内存使用超过设定阈值时，内核向eventfd发送通知
事件传递: kubelet通过epoll机制接收到eventfd事件

7.2 内核内存回收触发路径

当内存使用超过阈值时，Linux内核会执行以下操作：

7.2.1 内存分配路径

内核源码路径: mm/page_alloc.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// 主要内存分配函数调用链
struct page *__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
                                   int preferred_nid, nodemask_t *nodemask)
{
    struct page *page;
    
    // 快速路径分配
    page = get_page_from_freelist(gfp_mask, order, alloc_flags, &ac);
    if (likely(page))
        return page;
    
    // 慢速路径分配
    return __alloc_pages_slowpath(gfp_mask, order, &ac);
}

// 慢速路径处理
static struct page *__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                          struct alloc_context *ac)
{
    // 1. 尝试直接回收
    if (gfp_mask & __GFP_DIRECT_RECLAIM) {
        page = __alloc_pages_direct_reclaim(gfp_mask, order, ac);
        if (page)
            return page;
    }
    
    // 2. 尝试内存压缩
    if (can_direct_reclaim) {
        page = __alloc_pages_direct_compact(gfp_mask, order, ac);
        if (page)
            return page;
    }
    
    // 3. 可能触发OOM killer
    if (gfp_mask & __GFP_NOFAIL) {
        page = __alloc_pages_may_oom(gfp_mask, order, ac);
        if (page)
            return page;
    }
    
    return NULL;
}

调用流程图:

1
2
3
4
5
6
__alloc_pages_nodemask()                    // mm/page_alloc.c:4893
  └── __alloc_pages_slowpath()              // mm/page_alloc.c:4420
      ├── __alloc_pages_direct_reclaim()    // mm/page_alloc.c:3789
      │   └── __perform_reclaim()           // mm/page_alloc.c:3756
      │       └── try_to_free_pages()       // mm/vmscan.c:3234
      └── __alloc_pages_direct_compact()    // mm/page_alloc.c:3889

7.2.2 内存回收机制

内核源码路径: mm/vmscan.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// 主要内存回收函数
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
                               gfp_t gfp_mask, nodemask_t *nodemask)
{
    unsigned long nr_reclaimed;
    struct scan_control sc = {
        .nr_to_reclaim = SWAP_CLUSTER_MAX,
        .gfp_mask = gfp_mask,
        .order = order,
        .nodemask = nodemask,
        .priority = DEF_PRIORITY,
        .may_writepage = !laptop_mode,
        .may_unmap = 1,
        .may_swap = 1,
    };
    
    nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
    
    return nr_reclaimed;
}

// 执行内存回收的核心函数
static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
                                         struct scan_control *sc)
{
    int initial_priority = sc->priority;
    unsigned long total_scanned = 0;
    unsigned long writeback_threshold;
    
retry:
    delayacct_freepages_start();
    
    if (global_reclaim(sc))
        shrink_zones(zonelist, sc);  // 收缩内存区域
    
    total_scanned += sc->nr_scanned;
    if (sc->nr_reclaimed >= sc->nr_to_reclaim)
        return sc->nr_reclaimed;
    
    // 如果回收不足，降低优先级重试
    if (--sc->priority >= 0)
        goto retry;
    
    // 最后手段：触发OOM killer
    if (sc->order && sc->priority == 0) {
        out_of_memory(&oc);
    }
    
    delayacct_freepages_end();
    
    return sc->nr_reclaimed;
}

回收机制调用流程:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
try_to_free_pages()                         // mm/vmscan.c:3234
  └── do_try_to_free_pages()                // mm/vmscan.c:3140
      ├── shrink_zones()                    // mm/vmscan.c:2985
      │   └── shrink_zone()                 // mm/vmscan.c:2756
      │       ├── shrink_lruvec()           // mm/vmscan.c:2456
      │       │   ├── shrink_list()         // mm/vmscan.c:2089
      │       │   └── shrink_active_list()  // mm/vmscan.c:1943
      │       └── shrink_slab()             // mm/vmscan.c:567
      └── out_of_memory()                   // mm/oom_kill.c:1071
          └── select_bad_process()          // mm/oom_kill.c:379
              └── oom_kill_process()        // mm/oom_kill.c:874
                  └── oom_kill_task()       // mm/oom_kill.c:834

7.2.3 cgroup内存回收

内核源码路径: mm/memcontrol.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
// cgroup内存分配尝试
static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
                                gfp_t gfp_mask,
                                unsigned int nr_pages,
                                struct mem_cgroup **memcgp,
                                bool may_swap)
{
    unsigned int batch = max(CHARGE_BATCH, nr_pages);
    int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
    struct mem_cgroup *mem_over_limit;
    struct res_counter *fail_res;
    unsigned long nr_reclaimed;
    unsigned long flags = 0;
    int ret;
    
retry:
    ret = res_counter_charge(&memcg->res, batch, &fail_res);
    if (likely(!ret)) {
        if (!do_swap_account)
            return 0;
        ret = res_counter_charge(&memcg->memsw, batch, &fail_res);
        if (likely(!ret))
            return 0;
        
        res_counter_uncharge(&memcg->res, batch);
        mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
        flags |= MEM_CGROUP_RECLAIM_NOSWAP;
    } else
        mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
    
    if (batch > nr_pages) {
        batch = nr_pages;
        goto retry;
    }
    
    // 尝试cgroup级别的内存回收
    nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
    
    if (nr_reclaimed && nr_retries--)
        goto retry;
    
    // 如果回收失败，可能触发cgroup OOM
    if (!nr_reclaimed) {
        if (oom_check_kill(mem_over_limit)) {
            mem_cgroup_out_of_memory(mem_over_limit, gfp_mask, get_order(batch));
        }
    }
    
    return -ENOMEM;
}

// cgroup级别内存回收
static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
                                       gfp_t gfp_mask,
                                       unsigned long flags)
{
    unsigned long total = 0;
    bool noswap = false;
    int loop;
    
    if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
        noswap = true;
    if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && memcg->memsw_is_minimum)
        noswap = true;
    
    for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
        total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap);
        
        if (mem_cgroup_margin(memcg))
            break;
        
        if (!(gfp_mask & __GFP_WAIT))
            break;
    }
    
    return total;
}

cgroup内存回收调用流程:

1
2
3
4
5
6
7
mem_cgroup_try_charge()                     // mm/memcontrol.c:2567
  └── mem_cgroup_do_charge()                // mm/memcontrol.c:2456
      ├── mem_cgroup_reclaim()              // mm/memcontrol.c:1234
      │   └── try_to_free_mem_cgroup_pages() // mm/memcontrol.c:1189
      │       └── do_try_to_free_pages()    // mm/vmscan.c:3140
      └── mem_cgroup_out_of_memory()        // mm/memcontrol.c:1567
          └── mem_cgroup_oom_kill()         // mm/memcontrol.c:1523

7.3 eventfd通知机制

内核源码路径: mm/memcontrol.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// 内存阈值检查和通知机制
static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
{
    struct mem_cgroup_threshold_ary *t;
    unsigned long usage;
    int i;
    
    rcu_read_lock();
    if (!swap)
        t = rcu_dereference(memcg->thresholds.primary);
    else
        t = rcu_dereference(memcg->memsw_thresholds.primary);
    
    if (!t)
        goto unlock;
    
    usage = mem_cgroup_usage(memcg, swap);
    
    // 检查当前使用量是否超过任何阈值
    i = t->current_threshold;
    
    // 向上检查阈值
    for (; i < t->size && unlikely(t->entries[i].threshold <= usage); i++)
        eventfd_signal(t->entries[i].eventfd, 1);  // 发送eventfd信号
    
    // 向下检查阈值
    for (; i > 0 && unlikely(t->entries[i - 1].threshold > usage); i--)
        eventfd_signal(t->entries[i - 1].eventfd, 1);
    
    // 更新当前阈值索引
    t->current_threshold = i - 1;
unlock:
    rcu_read_unlock();
}

// 内存使用量检查函数
static void mem_cgroup_threshold(struct mem_cgroup *memcg)
{
    while (memcg) {
        __mem_cgroup_threshold(memcg, false);  // 检查内存阈值
        if (do_swap_account)
            __mem_cgroup_threshold(memcg, true);  // 检查swap阈值
        
        memcg = parent_mem_cgroup(memcg);
    }
}

// 在内存使用量变化时调用
static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
{
    unsigned long excess;
    struct mem_cgroup_per_zone *mz;
    struct mem_cgroup_tree_per_zone *mctz;
    int nid = page_to_nid(page);
    int zid = page_zonenum(page);
    
    mctz = soft_limit_tree_from_page(page);
    
    // 更新软限制树
    for (; memcg; memcg = parent_mem_cgroup(memcg)) {
        mz = mem_cgroup_page_zoneinfo(memcg, page);
        excess = soft_limit_excess(memcg);
        
        if (excess || mz->on_tree) {
            spin_lock(&mctz->lock);
            if (excess)
                __mem_cgroup_insert_exceeded(mz, mctz, excess);
            else
                __mem_cgroup_remove_exceeded(mz, mctz);
            spin_unlock(&mctz->lock);
        }
    }
}

eventfd通知调用流程:

1
2
3
4
5
6
7
mem_cgroup_charge_common()                  // mm/memcontrol.c:2789
  └── mem_cgroup_threshold()                // mm/memcontrol.c:4234
      └── __mem_cgroup_threshold()          // mm/memcontrol.c:4189
          ├── mem_cgroup_usage_in_excess()  // mm/memcontrol.c:1089
          └── eventfd_signal()             // fs/eventfd.c:68
              └── wake_up_locked_poll()     // kernel/sched/wait.c:231
                  └── wake_up_poll()        // include/linux/wait.h:185

7.3.1 内存压力等级处理

内核源码路径: mm/memcontrol.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// 内存压力等级定义
enum mem_cgroup_events_target {
    MEM_CGROUP_TARGET_THRESH,
    MEM_CGROUP_TARGET_SOFTLIMIT,
    MEM_CGROUP_TARGET_NUMAINFO,
    MEM_CGROUP_NTARGETS,
};

// 内存压力检测
static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
                                      enum mem_cgroup_events_target target)
{
    unsigned long val, next;
    
    val = __this_cpu_read(memcg->stat->nr_page_events);
    next = __this_cpu_read(memcg->stat->targets[target]);
    
    if ((long)next - (long)val < 0) {
        switch (target) {
        case MEM_CGROUP_TARGET_THRESH:
            next = val + THRESHOLDS_EVENTS_TARGET;
            break;
        case MEM_CGROUP_TARGET_SOFTLIMIT:
            next = val + SOFTLIMIT_EVENTS_TARGET;
            break;
        case MEM_CGROUP_TARGET_NUMAINFO:
            next = val + NUMAINFO_EVENTS_TARGET;
            break;
        default:
            break;
        }
        __this_cpu_write(memcg->stat->targets[target], next);
        return true;
    }
    return false;
}

7.3.2 强制页面回收机制

内核源码路径: mm/memcontrol.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// 强制清空cgroup内存的实现
static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
{
    int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
    
    /* we call try-to-free pages for make this cgroup empty */
    lru_add_drain_all();
    
    drain_all_stock_sync(memcg);
    
    /* try to free all pages in this cgroup */
    while (nr_retries && page_counter_read(&memcg->memory)) {
        int progress;
        
        if (signal_pending(current))
            return -EINTR;
        
        progress = try_to_free_mem_cgroup_pages(memcg, 1,
                                               GFP_KERNEL, true);
        if (!progress) {
            nr_retries--;
            /* maybe some writeback is necessary */
            congestion_wait(BLK_RW_ASYNC, HZ/10);
        }
    }
    
    return 0;
}

// cgroup内存强制回收的sysfs接口
static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
                                           char *buf, size_t nbytes,
                                           loff_t off)
{
    struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
    
    if (mem_cgroup_is_root(memcg))
        return -EINVAL;
    
    return mem_cgroup_force_empty(memcg) ?: nbytes;
}

8. Linux内核内存管理机制详解

8.1 内存回收机制启动

当eventfd通知触发后，Linux内核会启动多层次的内存管理响应机制：

8.1.1 背景回收（kswapd）

内核源码路径: mm/vmscan.c

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
// kswapd守护进程的主循环
static int kswapd(void *p)
{
    struct pglist_data *pgdat = (struct pglist_data*)p;
    struct task_struct *tsk = current;
    
    tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
    set_freezable();
    
    for ( ; ; ) {
        bool ret;
        
        alloc_order = reclaim_order = 0;
        classzone_idx = pgdat->nr_zones - 1;
        
        ret = try_to_freeze();
        if (kthread_should_stop())
            break;
        
        // 检查是否需要进行内存回收
        if (!ret) {
            trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx, alloc_order);
            reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
            if (reclaim_order < alloc_order)
                goto kswapd_try_sleep;
        }
        
kswapd_try_sleep:
        // 如果没有内存压力，进入睡眠状态
        if (!kswapd_shrink_node(pgdat, &sc))
            schedule();
    }
    
    tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
    return 0;
}

// 平衡页面分配的核心函数
static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
{
    int i;
    int end_zone = 0;
    unsigned long nr_soft_reclaimed;
    unsigned long nr_soft_scanned;
    struct scan_control sc = {
        .gfp_mask = GFP_KERNEL,
        .order = order,
        .priority = DEF_PRIORITY,
        .may_writepage = !laptop_mode,
        .may_unmap = 1,
        .may_swap = 1,
    };
    
    do {
        unsigned long nr_attempted = 0;
        bool raise_priority = true;
        bool pgdat_needs_compaction = (order > 0);
        
        sc.nr_reclaimed = 0;
        
        // 对每个内存区域进行回收
        for (i = classzone_idx; i >= 0; i--) {
            struct zone *zone = pgdat->node_zones + i;
            
            if (!populated_zone(zone))
                continue;
            
            // 检查水位线
            if (!zone_balanced(zone, order, 0, classzone_idx)) {
                end_zone = i;
                break;
            }
        }
        
        if (i < 0)
            goto out;
        
        // 执行内存回收
        for (i = 0; i <= end_zone; i++) {
            struct zone *zone = pgdat->node_zones + i;
            
            if (!populated_zone(zone))
                continue;
            
            sc.nr_scanned = 0;
            
            nr_soft_scanned = 0;
            nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
                                                             sc.order, sc.gfp_mask,
                                                             &nr_soft_scanned);
            sc.nr_reclaimed += nr_soft_reclaimed;
            
            // 收缩该区域
            shrink_zone(zone, &sc, zone_idx(zone) == classzone_idx);
        }
        
        // 降低优先级，更激进地回收
        if (sc.priority < DEF_PRIORITY - 2)
            sc.may_writepage = 1;
        
    } while (--sc.priority >= 0);
    
out:
    return sc.order;
}

8.1.2 直接回收（Direct Reclaim）

内核源码路径: mm/page_alloc.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// 直接回收内存的实现
static struct page *__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
                                                unsigned int alloc_flags, const struct alloc_context *ac,
                                                unsigned long *did_some_progress)
{
    struct page *page = NULL;
    unsigned long pflags;
    bool drained = false;
    
    psi_memstall_enter(&pflags);
    *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
    if (unlikely(!(*did_some_progress)))
        goto out;
    
retry:
    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
    
    // 如果分配失败，尝试排空per-CPU页面列表
    if (!page && !drained) {
        unreserve_highatomic_pageblock(ac, false);
        drain_all_pages(NULL);
        drained = true;
        goto retry;
    }
    
out:
    psi_memstall_leave(&pflags);
    return page;
}

// 执行内存回收的核心函数
static unsigned long __perform_reclaim(gfp_t gfp_mask, unsigned int order,
                                      const struct alloc_context *ac)
{
    struct reclaim_state reclaim_state;
    unsigned long progress;
    unsigned int noreclaim_flag;
    
    cond_resched();
    
    // 设置回收状态
    noreclaim_flag = memalloc_noreclaim_save();
    reclaim_state.reclaimed_slab = 0;
    current->reclaim_state = &reclaim_state;
    
    // 执行页面回收
    progress = try_to_free_pages(ac->zonelist, order, gfp_mask, ac->nodemask);
    
    current->reclaim_state = NULL;
    memalloc_noreclaim_restore(noreclaim_flag);
    
    cond_resched();
    
    return progress;
}

8.2 cgroup级别的内存压力处理

8.2.1 内存压力等级检测

内核源码路径: mm/memcontrol.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// 内存压力等级定义
enum mem_cgroup_events_target {
    MEM_CGROUP_TARGET_THRESH,
    MEM_CGROUP_TARGET_SOFTLIMIT,
    MEM_CGROUP_TARGET_NUMAINFO,
    MEM_CGROUP_NTARGETS,
};

// 内存压力检测函数
static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
                                      enum mem_cgroup_events_target target)
{
    unsigned long val, next;
    
    val = __this_cpu_read(memcg->stat->nr_page_events);
    next = __this_cpu_read(memcg->stat->targets[target]);
    
    if ((long)next - (long)val < 0) {
        switch (target) {
        case MEM_CGROUP_TARGET_THRESH:
            next = val + THRESHOLDS_EVENTS_TARGET;
            break;
        case MEM_CGROUP_TARGET_SOFTLIMIT:
            next = val + SOFTLIMIT_EVENTS_TARGET;
            break;
        case MEM_CGROUP_TARGET_NUMAINFO:
            next = val + NUMAINFO_EVENTS_TARGET;
            break;
        default:
            break;
        }
        __this_cpu_write(memcg->stat->targets[target], next);
        return true;
    }
    return false;
}

// 内存事件处理函数
static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
                                        struct page *page,
                                        bool compound, int nr_pages)
{
    // 更新统计信息
    __this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS], nr_pages);
    
    if (PageSwapBacked(page))
        __this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE], nr_pages);
    
    // 检查是否需要触发事件
    __this_cpu_add(memcg->stat->nr_page_events, nr_pages);
    
    if (mem_cgroup_event_ratelimit(memcg, MEM_CGROUP_TARGET_THRESH)) {
        mem_cgroup_threshold(memcg);
        mem_cgroup_oom_check(memcg);
    }
}

8.2.2 强制页面回收机制

内核源码路径: mm/memcontrol.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// 强制清空cgroup内存的实现
static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
{
    int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
    
    /* 调用try-to-free pages来清空这个cgroup */
    lru_add_drain_all();
    
    drain_all_stock_sync(memcg);
    
    /* 尝试释放这个cgroup中的所有页面 */
    while (nr_retries && page_counter_read(&memcg->memory)) {
        int progress;
        
        if (signal_pending(current))
            return -EINTR;
        
        progress = try_to_free_mem_cgroup_pages(memcg, 1,
                                               GFP_KERNEL, true);
        if (!progress) {
            nr_retries--;
            /* 可能需要一些写回操作 */
            congestion_wait(BLK_RW_ASYNC, HZ/10);
        }
    }
    
    return 0;
}

// cgroup内存强制回收的sysfs接口
static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
                                           char *buf, size_t nbytes,
                                           loff_t off)
{
    struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
    
    if (mem_cgroup_is_root(memcg))
        return -EINVAL;
    
    return mem_cgroup_force_empty(memcg) ?: nbytes;
}

// cgroup级别的页面回收
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
                                          unsigned long nr_pages,
                                          gfp_t gfp_mask,
                                          bool may_swap)
{
    struct zonelist *zonelist;
    unsigned long nr_reclaimed;
    unsigned long pflags;
    int nid;
    struct scan_control sc = {
        .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
        .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
                   (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
        .reclaim_idx = MAX_NR_ZONES - 1,
        .target_mem_cgroup = memcg,
        .priority = DEF_PRIORITY,
        .may_writepage = !laptop_mode,
        .may_unmap = 1,
        .may_swap = may_swap,
    };
    
    psi_memstall_enter(&pflags);
    
    // 选择合适的节点进行回收
    nid = mem_cgroup_select_victim_node(memcg);
    
    zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK];
    
    trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask);
    
    noreclaim_flag = memalloc_noreclaim_save();
    nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
    memalloc_noreclaim_restore(noreclaim_flag);
    
    trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
    
    psi_memstall_leave(&pflags);
    
    return nr_reclaimed;
}

8.3 OOM Killer机制

8.3.1 OOM检测和触发

内核源码路径: mm/oom_kill.c

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
// OOM killer的主要入口点
bool out_of_memory(struct oom_control *oc)
{
    unsigned long freed = 0;
    enum oom_constraint constraint = CONSTRAINT_NONE;
    
    if (oom_killer_disabled)
        return false;
    
    if (!is_memcg_oom(oc)) {
        blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
        if (freed > 0)
            /* Got some memory back in the last second. */
            return true;
    }
    
    // 检查约束条件
    constraint = constrained_alloc(oc);
    if (constraint != CONSTRAINT_MEMORY_POLICY)
        oc->nodemask = NULL;
    
    check_panic_on_oom(oc, constraint);
    
    if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
        current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
        current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
        get_task_struct(current);
        oom_kill_process(oc, current, 0, oc->totalpages,
                        "Out of memory (oom_kill_allocating_task)");
        return true;
    }
    
    // 选择要杀死的进程
    select_bad_process(oc);
    
    /* Found nothing?!?! Either we hang forever, or we panic. */
    if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
        dump_header(oc, NULL);
        panic("Out of memory and no killable processes...\n");
    }
    
    if (oc->chosen && oc->chosen != (void *)-1UL) {
        oom_kill_process(oc, oc->chosen, oc->chosen_points, oc->totalpages,
                        "Out of memory");
        /*
         * Give the killed process a good chance to exit before trying
         * to allocate memory again.
         */
        schedule_timeout_killable(1);
    }
    
    return !!oc->chosen;
 }
 
 // 选择要杀死的进程
 static void select_bad_process(struct oom_control *oc)
 {
     if (is_memcg_oom(oc))
         mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
     else {
         struct task_struct *p;
         
         rcu_read_lock();
         for_each_process(p) {
             if (oom_evaluate_task(p, oc))
                 break;
         }
         rcu_read_unlock();
     }
     
     oc->chosen_points = oc->chosen_points * 1000 / oc->totalpages;
 }
 
 // 评估进程是否适合被杀死
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
     struct oom_control *oc = arg;
     unsigned long points;
     
     if (oom_unkillable_task(task, NULL, oc->nodemask))
         goto next;
     
     // 计算OOM分数
     points = oom_badness(task, NULL, oc->nodemask, oc->totalpages);
     if (!points || points < oc->chosen_points)
         goto next;
     
     if (oc->chosen)
         put_task_struct(oc->chosen);
     get_task_struct(task);
     oc->chosen = task;
     oc->chosen_points = points;
 next:
     return 0;
 }
 
 // 计算进程的OOM分数
 unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
                          const nodemask_t *nodemask, unsigned long totalpages)
 {
     long points;
     long adj;
     
     if (oom_unkillable_task(p, memcg, nodemask))
         return 0;
     
     p = find_lock_task_mm(p);
     if (!p)
         return 0;
     
     adj = (long)p->signal->oom_score_adj;
     if (adj == OOM_SCORE_ADJ_MIN ||
             test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
             in_vfork(p)) {
         task_unlock(p);
         return 0;
     }
     
     // 基于内存使用量计算基础分数
     points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
              mm_pgtables_bytes(p->mm) / PAGE_SIZE;
     task_unlock(p);
     
     // 应用用户调整值
     adj *= totalpages / 1000;
     points += adj;
     
     // 确保分数在有效范围内
     return points > 0 ? points : 1;
 }

8.3.2 进程终止执行

内核源码路径: mm/oom_kill.c

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
// 执行OOM kill的核心函数
static void oom_kill_process(struct oom_control *oc, struct task_struct *p,
                            unsigned int points, unsigned long totalpages,
                            const char *message)
{
    struct task_struct *victim = p;
    struct task_struct *child;
    struct task_struct *t;
    struct mm_struct *mm;
    unsigned int victim_points = 0;
    static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
                                 DEFAULT_RATELIMIT_BURST);
    bool can_oom_reap = true;
    
    // 检查是否可以杀死该进程
    task_lock(p);
    if (task_will_free_mem(p)) {
        mark_oom_victim(p);
        wake_oom_reaper(p);
        task_unlock(p);
        put_task_struct(p);
        return;
    }
    task_unlock(p);
    
    if (__ratelimit(&oom_rs))
        dump_header(oc, p);
    
    pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
           message, task_pid_nr(p), p->comm, points);
    
    // 寻找最大的子进程
    read_lock(&tasklist_lock);
    for_each_thread(p, t) {
        list_for_each_entry(child, &t->children, sibling) {
            unsigned int child_points;
            
            if (process_shares_mm(child, p->mm))
                continue;
            
            child_points = oom_badness(child, oc->memcg, oc->nodemask,
                                      oc->totalpages);
            if (child_points > victim_points) {
                put_task_struct(victim);
                victim = child;
                victim_points = child_points;
                get_task_struct(victim);
            }
        }
    }
    read_unlock(&tasklist_lock);
    
    // 标记受害者
    p = find_lock_task_mm(victim);
    if (!p) {
        put_task_struct(victim);
        return;
    } else if (victim != p) {
        get_task_struct(p);
        put_task_struct(victim);
        victim = p;
    }
    
    // 检查内存映射
    mm = victim->mm;
    atomic_inc(&mm->mm_count);
    
    // 如果进程正在退出，不需要杀死
    if (task_will_free_mem(victim)) {
        mark_oom_victim(victim);
        wake_oom_reaper(victim);
        task_unlock(victim);
        mmput(mm);
        put_task_struct(victim);
        return;
    }
    
    if (victim->flags & PF_KTHREAD) {
        task_unlock(victim);
        mmput(mm);
        put_task_struct(victim);
        return;
    }
    
    if (is_global_init(victim)) {
        can_oom_reap = false;
        set_bit(MMF_OOM_SKIP, &mm->flags);
        pr_info("oom killer %d (%s) has mm pinned by %d\n",
                task_pid_nr(victim), victim->comm,
                atomic_read(&mm->mm_count));
    }
    
    // 标记为OOM受害者
    mark_oom_victim(victim);
    pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
           task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
           K(get_mm_counter(victim->mm, MM_ANONPAGES)),
           K(get_mm_counter(victim->mm, MM_FILEPAGES)),
           K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
    task_unlock(victim);
    
    // 向所有线程发送SIGKILL信号
    rcu_read_lock();
    for_each_process(p) {
        if (task_pid_nr(p) == task_pid_nr(victim))
            continue;
        if (!process_shares_mm(p, mm))
            continue;
        if (same_thread_group(p, victim))
            continue;
        if (is_global_init(p)) {
            can_oom_reap = false;
            set_bit(MMF_OOM_SKIP, &mm->flags);
            continue;
        }
        do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
    }
    rcu_read_unlock();
    
    // 发送SIGKILL给受害者
    do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
    
    // 启动OOM reaper来回收内存
    if (can_oom_reap)
        wake_oom_reaper(victim);
    
    mmput(mm);
    put_task_struct(victim);
}

8.4 内存分配策略调整

8.4.1 GFP标志对回收行为的影响

内核源码路径: include/linux/gfp.h

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// GFP标志定义
#define ___GFP_DMA              0x01u
#define ___GFP_HIGHMEM          0x02u
#define ___GFP_DMA32            0x04u
#define ___GFP_MOVABLE          0x08u
#define ___GFP_RECLAIMABLE      0x10u
#define ___GFP_HIGH             0x20u
#define ___GFP_IO               0x40u
#define ___GFP_FS               0x80u
#define ___GFP_COLD             0x100u
#define ___GFP_NOWARN           0x200u
#define ___GFP_RETRY_MAYFAIL    0x400u
#define ___GFP_NOFAIL           0x800u
#define ___GFP_NORETRY          0x1000u
#define ___GFP_MEMALLOC         0x2000u
#define ___GFP_COMP             0x4000u
#define ___GFP_ZERO             0x8000u
#define ___GFP_NOMEMALLOC       0x10000u
#define ___GFP_HARDWALL         0x20000u
#define ___GFP_THISNODE         0x40000u
#define ___GFP_ATOMIC           0x80000u
#define ___GFP_ACCOUNT          0x100000u
#define ___GFP_DIRECT_RECLAIM   0x200000u
#define ___GFP_WRITE            0x400000u
#define ___GFP_KSWAPD_RECLAIM   0x800000u

// 常用的GFP组合
#define GFP_ATOMIC      (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
#define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
#define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
#define GFP_NOWAIT      (__GFP_KSWAPD_RECLAIM)
#define GFP_NOIO        (__GFP_RECLAIM)
#define GFP_NOFS        (__GFP_RECLAIM | __GFP_IO)
#define GFP_TEMPORARY   (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
                        __GFP_RECLAIMABLE)
#define GFP_USER        (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_DMA         __GFP_DMA
#define GFP_DMA32       __GFP_DMA32
#define GFP_HIGHUSER    (GFP_USER | __GFP_HIGHMEM)
#define GFP_HIGHUSER_MOVABLE    (GFP_USER | __GFP_HIGHMEM | __GFP_MOVABLE)
#define GFP_TRANSHUGE_LIGHT     ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
                                 __GFP_NOMEMALLOC | __GFP_NOWARN) & \
                                 ~__GFP_RECLAIM)
#define GFP_TRANSHUGE   (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)

8.4.2 内存分配失败处理

内核源码路径: mm/page_alloc.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
// 内存分配失败的处理流程
static inline bool should_suppress_show_mem(void)
{
    bool ret = false;
    
#if NODES_SHIFT > 8
    ret = in_interrupt();
#endif
    return ret;
}

static void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
{
    struct va_format vaf;
    va_list args;
    static DEFINE_RATELIMIT_STATE(nopage_rs, 10*HZ, 1);
    
    if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
        debug_guardpage_minorder() > 0)
        return;
    
    pr_warn("%s: ", current->comm);
    
    va_start(args, fmt);
    vaf.fmt = fmt;
    vaf.va = &args;
    pr_warn("%pV", &vaf);
    va_end(args);
    
    pr_cont(", mode:%#x(%pGg), nodemask=", gfp_mask, &gfp_mask);
    if (nodemask)
        pr_cont("%*pbl\n", nodemask_pr_args(nodemask));
    else
        pr_cont("(null)\n");
    
    cpuset_print_current_mems_allowed();
    
    if (!should_suppress_show_mem())
        show_mem(SHOW_MEM_FILTER_NODES, nodemask);
}

// 检查分配是否应该重试
static inline bool should_reclaim_retry(gfp_t gfp_mask, unsigned order,
                                       struct alloc_context *ac,
                                       int alloc_flags,
                                       bool did_some_progress,
                                       int *no_progress_loops)
{
    struct zone *zone;
    struct zoneref *z;
    
    // 如果不允许重试，直接返回false
    if (!(gfp_mask & __GFP_RETRY_MAYFAIL))
        return false;
    
    // 如果order太大，不重试
    if (order > PAGE_ALLOC_COSTLY_ORDER)
        return false;
    
    // 如果有进展，重置计数器并重试
    if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
        *no_progress_loops = 0;
    else
        (*no_progress_loops)++;
    
    // 检查是否还有可回收的内存
    for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
                                    ac->nodemask) {
        unsigned long available;
        unsigned long reclaimable;
        unsigned long min_wmark = min_wmark_pages(zone);
        
        available = reclaimable = zone_reclaimable_pages(zone);
        available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
        
        // 如果有足够的可回收内存，继续重试
        if (available > min_wmark)
            return true;
    }
    
    // 如果没有进展且重试次数过多，停止重试
    if (*no_progress_loops > MAX_RECLAIM_RETRIES)
        return false;
    
    
    return true;
}

9. 完整流程图

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────────┐
│   kubelet.go    │───▶│ eviction_manager │───▶│ memory_threshold_   │
│   启动          │    │     .Start()     │    │    notifier         │
└─────────────────┘    └──────────────────┘    └─────────────────────┘
                                │                          │
                                ▼                          ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────────┐
│   监控循环      │◀───│   synchronize()  │◀───│ threshold_notifier_ │
│ (定期执行)      │    │   (控制循环)     │    │    linux.go         │
└─────────────────┘    └──────────────────┘    └─────────────────────┘
                                │                          │
                                ▼                          ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────────┐
│   Pod排序和     │◀───│   阈值检查和     │◀───│   cgroup.event_     │
│   驱逐选择      │    │   资源回收       │    │    control          │
└─────────────────┘    └──────────────────┘    └─────────────────────┘
                                │                          │
                                ▼                          ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────────┐
│   evictPod()    │◀───│   killPodFunc    │◀───│   Linux内核         │
│   Pod终止       │    │   (Pod杀死)      │    │   内存回收机制      │
└─────────────────┘    └──────────────────┘    └─────────────────────┘
                                                          │
                                                          ▼
                                                ┌─────────────────────┐
                                                │   内存回收流程:     │
                                                │   1. kswapd回收     │
                                                │   2. 直接回收       │
                                                │   3. cgroup回收     │
                                                │   4. OOM Killer     │
                                                └─────────────────────┘

10. 关键配置参数

9.1 kubelet配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# kubelet配置文件
evictionHard:
  memory.available: "100Mi"
  nodefs.available: "10%"
  imagefs.available: "15%"
evictionSoft:
  memory.available: "200Mi"
evictionSoftGracePeriod:
  memory.available: "1m30s"
evictionMaxPodGracePeriod: 30
evictionMonitoringPeriod: "10s"
kernelMemcgNotification: true  # 启用内核内存cgroup通知

9.2 cgroup配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 查看内存cgroup挂载点
cat /proc/mounts | grep memory

# 查看Pod的内存使用
cat /sys/fs/cgroup/memory/kubepods/pod<pod-uid>/memory.usage_in_bytes

# 查看内存限制
cat /sys/fs/cgroup/memory/kubepods/pod<pod-uid>/memory.limit_in_bytes

# 查看内存统计
cat /sys/fs/cgroup/memory/kubepods/pod<pod-uid>/memory.stat

11. 监控和排障

10.1 关键日志

1
2
3
4
5
6
7
8
9
# kubelet驱逐相关日志
journalctl -u kubelet | grep -i eviction

# 内存压力相关日志
journalctl -u kubelet | grep -i "memory pressure"

# OOM相关日志
dmesg | grep -i "killed process"
journalctl -k | grep -i oom

10.2 监控指标

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# kubelet指标
curl -s http://localhost:10255/metrics | grep eviction

# 节点内存使用
free -h
cat /proc/meminfo

# cgroup内存使用
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/memory.stat

10.3 排障步骤

检查节点内存使用情况
1 2
free -h cat /proc/meminfo

检查kubelet配置

1
cat /var/lib/kubelet/config.yaml | grep -A 10 eviction

检查Pod内存使用

1
2
kubectl top pods --all-namespaces
kubectl describe node <node-name>

检查驱逐事件

1
kubectl get events --field-selector reason=Evicted

检查cgroup设置

1
find /sys/fs/cgroup/memory -name "memory.usage_in_bytes" -exec cat {} \;

12. 性能优化建议

11.1 阈值配置优化

合理设置硬阈值: 避免过低导致频繁驱逐
配置软阈值: 提供缓冲时间进行优雅处理
调整监控间隔: 平衡响应速度和系统开销

11.2 内存管理优化

启用内存cgroup通知: 提高响应速度
配置合适的Pod资源限制: 避免内存泄漏
使用内存亲和性: 优化NUMA节点内存分配

11.3 系统级优化

调整vm.swappiness: 控制swap使用策略
配置内存overcommit: 优化内存分配策略
启用内存压缩: 提高内存利用率

13. 总结

从kubelet的evictionManager到Linux内核内存回收机制的完整流程包含以下关键环节：

初始化阶段: kubelet启动时创建并启动evictionManager
监控阶段: 通过内存阈值通知器和定期同步监控资源使用
检测阶段: 使用Linux cgroup事件机制实时检测内存阈值
决策阶段: 在synchronize方法中进行阈值检查和驱逐决策
执行阶段: 通过evictPod方法终止选定的Pod
回收阶段: Linux内核执行内存回收和OOM处理

这个机制确保了Kubernetes集群在内存资源紧张时能够及时响应，通过Pod驱逐来释放内存，避免整个节点因为内存不足而不可用。理解这个完整流程对于优化集群性能、排查内存相关问题以及设计高可用的容器化应用都具有重要意义。

14. 附录

Kubernetes issuers

kubelet counts active page cache against memory.available

Kubelet驱逐机制到Linux内核内存回收完整流程解析

Kubelet驱逐机制到Linux内核内存回收完整流程解析

概述

1. Kubelet启动阶段 - evictionManager初始化

1.1 启动入口

1.2 evictionManager创建过程

2. 驱逐管理器启动流程

2.1 Start方法实现

3. 内存阈值通知器机制

3.1 NewMemoryThresholdNotifier创建过程

3.2 UpdateThreshold方法 - 动态阈值计算

4. Linux Cgroup通知器实现

4.1 NewCgroupNotifier创建过程

4.2 Start方法 - 事件监听循环

5. 驱逐控制循环 - synchronize方法

5.1 主要执行流程

6. Pod驱逐执行

6.1 evictPod方法实现

7. 连接到Linux内核内存回收机制

7.1 cgroup.event_control机制

7.2 内核内存回收触发路径

7.2.1 内存分配路径

7.2.2 内存回收机制

7.2.3 cgroup内存回收

7.3 eventfd通知机制

7.3.1 内存压力等级处理

7.3.2 强制页面回收机制

8. Linux内核内存管理机制详解

8.1 内存回收机制启动

8.1.1 背景回收（kswapd）

8.1.2 直接回收（Direct Reclaim）

8.2 cgroup级别的内存压力处理

8.2.1 内存压力等级检测

8.2.2 强制页面回收机制

8.3 OOM Killer机制

8.3.1 OOM检测和触发

8.3.2 进程终止执行

8.4 内存分配策略调整

8.4.1 GFP标志对回收行为的影响

8.4.2 内存分配失败处理

9. 完整流程图

10. 关键配置参数

9.1 kubelet配置

9.2 cgroup配置

11. 监控和排障

10.1 关键日志

10.2 监控指标

10.3 排障步骤

12. 性能优化建议

11.1 阈值配置优化

11.2 内存管理优化

11.3 系统级优化

13. 总结

14. 附录

Kubernetes issuers

参考资料