Endless calls to xas_split_alloc() due to corrupted xarray entry (2024)

Messages in this thread

  • First message in thread
  • Matthew Wilcox
    • Gavin Shan
      • David Hildenbrand
        • Matthew Wilcox
          • Linus Torvalds

Patch in this message

  • Get diff 1
Endless calls to xas_split_alloc() due to corrupted xarray entry (1)
DateWed, 19 Jun 2024 11:45:22 +0200
SubjectRe: Endless calls to xas_split_alloc() due to corrupted xarray entry
FromDavid Hildenbrand <>
On 30.09.23 04:12, Gavin Shan wrote:
> Hi Matthew,
>
> On 9/30/23 05:17, Matthew Wilcox wrote:
>> On Mon, Sep 25, 2023 at 05:04:16PM +0700, Bagas Sanjaya wrote:
>>> On Fri, Sep 22, 2023 at 11:56:43AM +0800, Zhenyu Zhang wrote:
>>>> Hi all,
>>>>
>>>> we don't know how the xarray entry was corrupted. Maybe it's a known
>>>> issue to community.
>>>> Lets see.
>>>>
>>>> Contents
>>>> --------
>>>> 1. Problem Statement
>>>> 2. The call trace
>>>> 3. The captured data by bpftrace
>>>>
>>>>
>>>> 1. Problem Statement
>>>> --------------------
>>>> With 4k guest and 64k host, on aarch64(Ampere's Altra Max CPU) hit Call trace:
>>>> Steps:
>>>> 1) System setup hugepages on host.
>>>> # echo 60 > /proc/sys/vm/nr_hugepages
>>>> 2) Mount this hugepage to /mnt/kvm_hugepage.
>>>> # mount -t hugetlbfs -o pagesize=524288K none /mnt/kvm_hugepage
>>>
>>> What block device/disk image you use to format the filesystem?
>>
>> It's hugetlbfs, Bagas.
>>
>
> The hugetlbfs pages are reserved, but never used. In this way, the available
> system memory is reduced. So it's same affect as to "mem=xxx" boot parameter.
>
>>>> 3) HugePages didn't leak when using non-existent mem-path.
>>>> # mkdir -p /mnt/tmp
>>>> 4) Boot guest.
>>>> # /usr/libexec/qemu-kvm \
>>>> ...
>>>> -m 30720 \
>>>> -object '{"size": 32212254720, "mem-path": "/mnt/tmp", "qom-type":
>>>> "memory-backend-file"}' \
>>>> -smp 4,maxcpus=4,cores=2,threads=1,clusters=1,sockets=2 \
>>>> -blockdev '{"node-name": "file_image1", "driver": "file",
>>>> "auto-read-only": true, "discard": "unmap", "aio": "threads",
>>>> "filename": "/home/kvm_autotest_root/images/back_up_4k.qcow2",
>>>> "cache": {"direct": true, "no-flush": false}}' \
>>>> -blockdev '{"node-name": "drive_image1", "driver": "qcow2",
>>>> "read-only": false, "cache": {"direct": true, "no-flush": false},
>>>> "file": "file_image1"}' \
>>>> -device '{"driver": "scsi-hd", "id": "image1", "drive":
>>>> "drive_image1", "write-cache": "on"}' \
>>>>
>>>> 5) Wait about 1 minute ------> hit Call trace
>>>>
>>>> 2. The call trace
>>>> --------------------
>>>> [ 14.982751] block dm-0: the capability attribute has been deprecated.
>>>> [ 15.690043] PEFILE: Unsigned PE binary
>>>>
>>>>
>>>> [ 90.135676] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>>> [ 90.136629] rcu: 3-...0: (3 ticks this GP)
>>>> idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=232
>>>> [ 90.137293] rcu: (detected by 2, t=6012 jiffies, g=2085, q=2539 ncpus=4)
>>>> [ 90.137796] Task dump for CPU 3:
>>>> [ 90.138037] task:PK-Backend state:R running task stack:0
>>>> pid:2287 ppid:1 flags:0x00000202
>>>> [ 90.138757] Call trace:
>>>> [ 90.138940] __switch_to+0xc8/0x110
>>>> [ 90.139203] 0xb54a54f8c5fb0700
>>>>
>>>> [ 270.190849] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>>> [ 270.191722] rcu: 3-...0: (3 ticks this GP)
>>>> idle=e6ec/1/0x4000000000000000 softirq=6847/6849 fqs=1020
>>>> [ 270.192405] rcu: (detected by 1, t=24018 jiffies, g=2085, q=3104 ncpus=4)
>>>> [ 270.192876] Task dump for CPU 3:
>>>> [ 270.193099] task:PK-Backend state:R running task stack:0
>>>> pid:2287 ppid:1 flags:0x00000202
>>>> [ 270.193774] Call trace:
>>>> [ 270.193946] __switch_to+0xc8/0x110
>>>> [ 270.194336] 0xb54a54f8c5fb0700
>>>>
>>>> [ 1228.068406] ------------[ cut here ]------------
>>>> [ 1228.073011] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
>>>> xas_split_alloc+0xf8/0x128
>>>> [ 1228.080828] Modules linked in: binfmt_misc vhost_net vhost
>>>> vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
>>>> nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
>>>> nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
>>>> qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
>>>> ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
>>>> arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
>>>> crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
>>>> i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
>>>> i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
>>>> fuse
>>>> [ 1228.137630] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
>>>> G W 6.6.0-rc2-zhenyzha+ #5
>>>> [ 1228.147529] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
>>>> F31h (SCP: 2.10.20220810) 07/27/2022
>>>> [ 1228.156820] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>> [ 1228.163767] pc : xas_split_alloc+0xf8/0x128
>>>> [ 1228.167938] lr : __filemap_add_folio+0x33c/0x4e0
>>>> [ 1228.172543] sp : ffff80008dd4f1c0
>>>> [ 1228.175844] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
>>>> [ 1228.182967] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
>>>> [ 1228.190089] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
>>>> [ 1228.197211] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
>>>> [ 1228.204334] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
>>>> [ 1228.211456] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
>>>> [ 1228.218578] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
>>>> [ 1228.225701] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
>>>> [ 1228.232823] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
>>>> [ 1228.239945] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
>>>> [ 1228.247067] Call trace:
>>>> [ 1228.249500] xas_split_alloc+0xf8/0x128
>>>> [ 1228.253324] __filemap_add_folio+0x33c/0x4e0
>>>> [ 1228.257582] filemap_add_folio+0x48/0xd0
>>>> [ 1228.261493] page_cache_ra_order+0x214/0x310
>>>> [ 1228.265750] ondemand_readahead+0x1a8/0x320
>>>> [ 1228.269921] page_cache_async_ra+0x64/0xa8
>>>> [ 1228.274005] filemap_fault+0x238/0xaa8
>>>> [ 1228.277742] __xfs_filemap_fault+0x60/0x3c0 [xfs]
>>>> [ 1228.282491] xfs_filemap_fault+0x54/0x68 [xfs]
>>
>> This is interesting. This path has nothing to do with the hugetlbfs
>> filesystem you've created up above. And, just to be clear, this is
>> on the host, not in the guest, right?
>>
>
> Correct, the backtrce is seen on the host. The XFS file is used as backup
> memory to the guest. QEMU maps the entire file as PRIVATE and the VMA has
> been advised to huge page by madvise(MADV_HUGEPAGE). When the guest is
> started, QEMU calls madvise(MADV_POPULATE_WRITE) to populate the VMA. Since
> the VMA is private, there are copy-on-write page fault happening on
> calling to madvise(MADV_POPULATE_WRITE). In the page fault handler,
> there are readahead reuqests to be processed.
>
> The backtrace, originating from WARN_ON(), is triggered when attempt to
> allocate a huge page fails in the middle of readahead. In this specific
> case, we're falling back to order-0 with attempt to modify the xarray
> for this. Unfortunately, it's reported this particular scenario isn't
> supported by xas_split_alloc().
>
>
>>>> [ 1228.377124] ------------[ cut here ]------------
>>>> [ 1228.381728] WARNING: CPU: 2 PID: 4496 at lib/xarray.c:1010
>>>> xas_split_alloc+0xf8/0x128
>>>> [ 1228.389546] Modules linked in: binfmt_misc vhost_net vhost
>>>> vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
>>>> nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack
>>>> nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc
>>>> qrtr rfkill sunrpc vfat fat acpi_ipmi ipmi_ssif arm_spe_pmu
>>>> ipmi_devintf arm_cmn arm_dmc620_pmu ipmi_msghandler cppc_cpufreq
>>>> arm_dsu_pmu xfs libcrc32c ast drm_shmem_helper drm_kms_helper drm
>>>> crct10dif_ce ghash_ce igb nvme sha2_ce nvme_core sha256_arm64 sha1_ce
>>>> i2c_designware_platform sbsa_gwdt nvme_common i2c_algo_bit
>>>> i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
>>>> fuse
>>>> [ 1228.446348] CPU: 2 PID: 4496 Comm: qemu-kvm Kdump: loaded Tainted:
>>>> G W 6.6.0-rc2-zhenyzha+ #5
>>>> [ 1228.456248] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
>>>> F31h (SCP: 2.10.20220810) 07/27/2022
>>>> [ 1228.465538] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>> [ 1228.472486] pc : xas_split_alloc+0xf8/0x128
>>>> [ 1228.476656] lr : __filemap_add_folio+0x33c/0x4e0
>>>> [ 1228.481261] sp : ffff80008dd4f1c0
>>>> [ 1228.484563] x29: ffff80008dd4f1c0 x28: ffffd15825388c40 x27: 0000000000000001
>>>> [ 1228.491685] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
>>>> [ 1228.498807] x23: ffff80008dd4f270 x22: ffffffc202b00000 x21: 0000000000000000
>>>> [ 1228.505930] x20: ffffffc2007f9600 x19: 000000000000000d x18: 0000000000000014
>>>> [ 1228.513052] x17: 00000000b21b8a3f x16: 0000000013a8aa94 x15: ffffd15824625944
>>>> [ 1228.520174] x14: ffffffffffffffff x13: 0000000000000030 x12: 0101010101010101
>>>> [ 1228.527297] x11: 7f7f7f7f7f7f7f7f x10: 000000000000000a x9 : ffffd158252dd3fc
>>>> [ 1228.534419] x8 : ffff80008dd4f1c0 x7 : ffff07ffa0945468 x6 : ffff80008dd4f1c0
>>>> [ 1228.541542] x5 : 0000000000000018 x4 : 0000000000000000 x3 : 0000000000012c40
>>>> [ 1228.548664] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
>>>> [ 1228.555786] Call trace:
>>>> [ 1228.558220] xas_split_alloc+0xf8/0x128
>>>> [ 1228.562043] __filemap_add_folio+0x33c/0x4e0
>>>> [ 1228.566300] filemap_add_folio+0x48/0xd0
>>>> [ 1228.570211] page_cache_ra_order+0x214/0x310
>>>> [ 1228.574469] ondemand_readahead+0x1a8/0x320
>>>> [ 1228.578639] page_cache_async_ra+0x64/0xa8
>>>> [ 1228.582724] filemap_fault+0x238/0xaa8
>>>> [ 1228.586460] __xfs_filemap_fault+0x60/0x3c0 [xfs]
>>>> [ 1228.591210] xfs_filemap_fault+0x54/0x68 [xfs]
>>>>
>>>>
>>>>
>>>> 3. The captured data by bpftrace
>>>> (The following part is the crawl analysis of gshan@redhat.com )
>>>> --------------------
>>>> pid: 4475 task: qemu-kvm
>>>> file: /mnt/tmp/qemu_back_mem.mem-machine_mem.OdGYet (deleted)
>>>>
>>>> -------------------- inode --------------------
>>>> i_flags: 0x0
>>>> i_ino: 67333199
>>>> i_size: 32212254720
>>>>
>>>> ----------------- address_space ----------------
>>>> flags: 040
>>>> invalidate_lock
>>>> count: 256
>>>> owner: 0xffff07fff6e759c1
>>>> pid: 4496 task: qemu-kvm
>>>> wait_list.next: 0xffff07ffa20422e0
>>>> wait_list.prev: 0xffff07ffa20422e0
>>>>
>>>> -------------------- xarray --------------------
>>>> entry[0]: 0xffff080f7eda0002
>>>> shift: 18
>>>> offset: 0
>>>> count: 2
>>>> nr_values: 0
>>>> parent: 0x0
>>>> slots[00]: 0xffff07ffa094546a
>>>> slots[01]: 0xffff07ffa1b09b22
>>>>
>>>> entry[1]: 0xffff07ffa094546a
>>>> shift: 12
>>>> offset: 0
>>>> count: 20
>>>> nr_values: 0
>>>> parent: 0xffff080f7eda0000
>>>> slots[00]: 0xffffffc202880000
>>>> slots[01]: 0x2
>>>>
>>>> entry[2]: 0xffffffc202880000
>>>> shift: 104
>>>> offset: 128
>>>> count: 0
>>>> nr_values: 0
>>>> parent: 0xffffffc20304c888
>>>> slots[00]: 0xffff08009a960000
>>>> slots[01]: 0x2001ffffffff
>>>>
>>>> It seems the last xarray entry ("entry[2]") has been corrupted. "shift"
>>>> becomes 104 and "offset" becomes 128, which isn't reasonable.
>>
>> Um, no. Whatever tool you're using doesn't understand how XArrays work.
>> Fortunately, I wrote xa_dump() which does. entry[2] does not have bit
>> 1 set, so it is an entry, not a node. You're dereferencing a pointer to
>> a folio as if it's a pointer to a node, so no wonder it looks corrupted
>> to you. From this, we know that the folio is at least order-6, and it's
>> probably order-9 (because I bet this VMA has the VM_HUGEPAGE flag set,
>> and we're doing PMD-sized faults).
>>
>
> Indeed, entry[2] is a entry instead of a node, deferencing a folio.
> bpftrace was used to dump the xarray. you're correct that the VMA has
> flag VM_HUGEPAGE, set by madvise(MADV_HUGEPAGE). The order returned by
> xas_get_order() is 13, passed to xas_split_alloc().
>
> /*
> * xas->xa_shift = 0
> * XA_CHUNK_SHIFT = 6
> * order = 13 (512MB huge page size vs 64KB base page size)
> */
> void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
> gfp_t gfp)
> {
> unsigned int sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
> unsigned int mask = xas->xa_sibs;
>
> /* XXX: no support for splitting really large entries yet */
> if (WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT < order))
> goto nomem;
> :
> }

Resurrecting this, because I just got aware of it.

I recall talking to Willy at some point about the problem of order-13 not
being fully supported by the pagecache right now (IIRC primiarly splitting,
which should not happen for hugetlb, which is why there it is not a
problem). And I think we discussed just blocking that for now.

So we are trying to split an order-13 entry, because we ended up
allcoating+mapping an order-13 folio previously.

That's where things got wrong, with the current limitations, maybe?

#define MAX_PAGECACHE_ORDERHPAGE_PMD_ORDER

Which would translate to MAX_PAGECACHE_ORDER=13 on aarch64 with 64k.

Staring at xas_split_alloc:

WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT < order)

I suspect we don't really support THP on systems with CONFIG_BASE_SMALL.
So we can assume XA_CHUNK_SHIFT == 6.

I guess that the maximum order we support for splitting is 12? I got confused
trying to figure that out. ;)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e37e16ebff7a..354cd4b7320f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -352,9 +352,12 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
* limit the maximum allocation order to PMD size. I'm not aware of any
* assumptions about maximum order if THP are disabled, but 8 seems like
* a good order (that's 1MB if you're using 4kB pages)
+ *
+ * xas_split_alloc() does not support order-13 yet, so disable that for now,
+ * which implies no 512MB THP on arm64 with 64k.
*/
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER HPAGE_PMD_ORDER
+#define MAX_PAGECACHE_ORDER min(HPAGE_PMD_ORDER,12)
#else
#define MAX_PAGECACHE_ORDER 8
#endif

I think this does not apply to hugetlb because we never end up splitting
entries. But could this also apply to shmem + PMD THP?

--
Cheers,

David / dhildenb

Endless calls to xas_split_alloc() due to corrupted xarray entry (2)
Endless calls to xas_split_alloc() due to corrupted xarray entry (2024)
Top Articles
Real-time analysis. Part 1: Meet Apache Druid, Apache Kylin and Apache Pinot | Blog post by Dima Baranetskyi
Summer's in full swing and school's open: Why the great divide on start dates?
Tony's Delicatessen & Fresh Meats
Charli D'Amelio: Wie die junge Amerikannerin TikTok-Sensation wurde
Bingo Bling Promo Code 2023
Butte County Court Oroville Ca
Chesapeake Wv Topix
Coolmathgames.comool Math
Officially Announcing: Skyward
Timothy Warren Cobb Obituary
Post-Tribune Obits
Zulrah Strat Osrs
Kinoprogramm für Berlin und Umland
Traveltalkonline
Cuộc thi “Chung tay vì an toàn giao thông” năm 2024
Unit 8 Lesson 2 Coding Activity
Dow Futures Pre Market Cnn
Rooms For Rent Portland Oregon Craigslist
Dayz Nyheim Map
Enloe Bell Schedule
Nerdwallet American Express Gold
April 7 Final Jeopardy
Real Caca Girl Leak
Garagesalefinder Com
Elmira Star Gazette Obit
Minneapolis, MN Real Estate & Homes for Sale | realtor.com®
Best Fantime Accounts
Standard Specification for Annealed or Cold-Worked Austenitic Stainless Steel Sheet, Strip, Plate, and Flat Bar
Sona Twu
Papa Johns Mear Me
Union Supply Direct Wisconsin
855-392-7812
Knock At The Cabin Showtimes Near Alamo Drafthouse Raleigh
Www.cvs/Otchs/Simply
Liveops Nation Okta Com Sign In
Wayne Carini How Tall
Ohio Licensing Lookup
Franco Loja Net Worth
CareCredit Lawsuit - Illegal Credit Card Charges And Fees
Lenscrafters Westchester Mall
Mbta Commuter Rail Schedule Newburyport
Glowforge Forum
Wash World Of Lexington Coin Laundry
Smithfield Okta Login
Jailfunds Send Message
Lbl A-Z
Mpbn Schedule
Craigslist West Valley
Www.888Tt.xyz
Amazing Lash Bay Colony
Classic Forbidden Romance: 6 Reasons To Watch C-Drama “Love Between Fairy And Devil”
Sammyflood
Latest Posts
Article information

Author: Rev. Porsche Oberbrunner

Last Updated:

Views: 5936

Rating: 4.2 / 5 (73 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Rev. Porsche Oberbrunner

Birthday: 1994-06-25

Address: Suite 153 582 Lubowitz Walks, Port Alfredoborough, IN 72879-2838

Phone: +128413562823324

Job: IT Strategist

Hobby: Video gaming, Basketball, Web surfing, Book restoration, Jogging, Shooting, Fishing

Introduction: My name is Rev. Porsche Oberbrunner, I am a zany, graceful, talented, witty, determined, shiny, enchanting person who loves writing and wants to share my knowledge and understanding with you.