Skip to content

Home

Why People Don’t Change Until They Feel It

When organizations talk about change, they often start with strategy decks, new structures, revised cultures, and updated systems. Leaders gather data, analyze trends, and craft well-reasoned transformation plans. But despite all that effort, real change rarely sticks. Why? Because it misses the mark on what matters most—people.

At the heart of every successful transformation is a shift in human behavior. And human behavior doesn’t change because of logic alone. It changes when people feel something deep enough to move them. The truth is, even in the most analytical, data-obsessed environments—even among the sharpest MBAs—it’s emotion that sparks action, not spreadsheets.

The most powerful change doesn’t begin with analysis. It begins with a moment of clarity. A moment when someone sees the problem in a new light, feels its urgency, and suddenly knows: “Something has to change.” That’s the spark that ignites real momentum. You don’t convince someone to marry with tax benefits or lower rent—you move them with love, with conviction, with purpose. The same goes for change in organizations.

And that’s why crisis, though painful, is often the greatest catalyst. When fear, anxiety, or even a sense of despair takes hold, people become willing to let go of the habits that once served them but now hold them back. As a Harvard Business School professor once put it: “In the absence of a pressing crisis, people will keep doing what they’ve always done.” That’s human nature. We cling to comfort until discomfort becomes impossible to ignore.

So, if you're a leader driving transformation, don’t just ask for new processes or present another report. Paint the picture of what’s at stake. Create urgency—not by fear alone, but by showing what could be lost and what could be gained. Connect people emotionally to the mission. Make them feel the cost of standing still and the possibility that lies ahead.

Because change is never just about plans, systems, or strategy. It’s about belief. It’s about hope. It’s about lighting a fire inside people that says: “We can do better. We must do better. And it starts with me.”

Real transformation doesn’t come from the head. It comes from the heart. And when you speak to the heart, people move.

為什麼人們非得有所感受才會改變

當組織談論變革時,往往從策略簡報、新的架構、調整後的文化與升級的系統開始。領導者蒐集數據、分析趨勢,並設計出合理的轉型計畫。但儘管投入了這麼多心力,真正的改變卻很少能持續下去。為什麼?因為他們忽略了最關鍵的元素──人。

每一場成功的轉型背後,都有一個共同的核心:人類行為的改變。而人類的行為,並不僅僅因為邏輯推理而改變。它會在內心被某種深刻的感受觸動時,才真正產生轉變。即使是在最重視數據分析、最講求理性推論的環境中──即便是那些最頂尖的 MBA 畢業生──真正驅動行動的從來都不是試算表,而是情感。

最有力量的改變,並非始於分析,而是源自一個清晰的瞬間。一個讓人重新看見問題本質的時刻,一個讓人感受到迫切性的時刻,一個讓人心中油然而生「非改不可」的時刻。那正是點燃真正行動動力的火花。你不會因為節稅或房租更便宜而說服一個人結婚;你會因為愛、信念與人生的意義打動對方。在組織裡也是一樣的道理。

這也是為什麼,儘管令人痛苦,危機往往是最強大的催化劑。當恐懼、焦慮,甚至是絕望籠罩時,人們才會願意放下過去那些曾經帶來成功,但如今卻成為阻礙的習慣。正如哈佛商學院一位教授所言:「在沒有迫切危機的情況下,人們會繼續按照他們習慣的方式做事。」這就是人性。我們總是緊抓著熟悉與舒適,直到不舒服的程度讓我們再也無法忽視。

因此,如果你是領導變革的人,不要只是要求新的流程或再交出一份報告。請描繪出局勢的真相,創造出一種緊迫感──不只是訴諸恐懼,更要展現不改變會失去什麼,以及改變後能獲得什麼。讓人們在情感上與使命連結。讓他們真正「感受」到停滯的代價,以及前方機會的可貴。

因為變革,從來都不只是計畫、系統或策略的問題。它關乎信念、關乎希望,更關乎在每個人心中點燃那把火──那把說著「我們可以做得更好,我們必須做得更好,而這一切,從我開始」的火。

真正的轉型,從來不是從頭腦開始,而是從心開始。當你觸動人心時,行動自然隨之而來。

Change Yourself to Become a Better Decision-Maker

In April 2025, the global financial markets were rocked by a sharp and sudden downturn. The S&P 500 plunged nearly 5%—its worst single-day drop in years—while the Dow Jones tumbled over 1,600 points. The cause? A wave of new tariffs abruptly announced by Donald Trump’s administration, including a blanket 10% tax on all imports and steeper penalties on goods from key trading partners like China, Japan, and the European Union. Within hours, China retaliated with a 34% tariff on U.S. goods, igniting fears of a global trade war and deepening economic uncertainty.

It was a stark reminder of what happens when decisions are made not from facts, data, and consultation—but from bias, ideology, and political theater. The administration ignored expert warnings, economic forecasts, and long-term repercussions. This wasn’t leadership—it was ego-driven showmanship masquerading as strength.

Reflecting on my MBA journey, one lesson stands out more than any other: good decisions must be grounded in facts and a scientific approach. Theories and models are important, but more crucial is a mindset that respects evidence, seeks truth, and invites challenge. History backs this up—repeatedly.

Take 19th-century medicine, for example. Surgeons once dismissed the idea of handwashing, even when clear data showed it drastically reduced infections. The resistance wasn’t because they lacked intelligence—it was because they lacked humility. A small act could have saved millions of lives, but change was delayed by arrogance. Just like that, even the best and brightest can fail when they ignore data and cling to outdated beliefs.

This brings us to the heart of the issue: if you decide now to design a decision-making architecture for your team or company, there is one more thing you must change—yourself. More specifically, your self-image and how you present yourself as a decision-maker.

A decision-maker isn’t just someone who picks a direction. They inspire others to follow it. Decision-making and leadership are inseparable—and leadership, in the end, is in the eye of the beholder. There is no such thing as a leader without followers. To be seen as a leader, others must believe in your leadership.

We’ve been conditioned to associate leadership with decisiveness, boldness, and unwavering confidence. As psychologist Gary Klein noted, John Wayne became the archetype of leadership in popular culture—the stoic cowboy who sizes up the situation and declares, “Here’s what we’re going to do.” No hesitation, no discussion, just action.

But this image is more dangerous than it is inspiring.

The “cowboy” style leader prides themselves on intuition and experience, avoids self-doubt, and sees input from others as weakness. They suppress dissent, foster groupthink, and make decisions in echo chambers. And the results? Chaotic policy. Unnecessary risk. Fragile systems built on sand.

We’ve seen this play out—not just in politics, but in boardrooms and startups, too. The best decision-making techniques run against this grain. Effective leaders today encourage diverse viewpoints, embrace uncertainty, and are willing to change their minds. That doesn’t make them weak—it makes them wise.

Of course, balancing openness during deliberation with full commitment during execution is no easy feat. As Eric Schmidt once said, the formula is “diverse perspectives and a deadline.” Amazon follows a similar philosophy: “Disagree and commit.” You challenge ideas when it’s time to decide—and once the choice is made, you commit with everything you’ve got.

That requires a shift in how we see leadership. We need to let go of outdated myths and embrace a new model—one where leadership is defined not by bravado, but by process and purpose. Look at successful CEOs who value collaboration and rigorous decision architecture. They don't pretend to have all the answers. Instead, they build teams, systems, and cultures that produce the best answers together. They own the decision, yes—but they know the power lies in the process.

Jim Collins called this the Level 5 Leader—someone who combines deep humility with fierce resolve. They are driven, passionate, and relentless in pursuit of results, but also deeply grounded and self-aware. These leaders are rare—but they exist. And they don’t look anything like John Wayne or Donald Trump.

Instead, they look a little more like Odysseus. When faced with the Sirens in Homer’s Odyssey, Odysseus didn’t trust his instincts. He knew he would be tempted. So he instructed his crew to plug their ears and tie him to the mast. He handed control to his team, trusting the system he’d built over his own momentary impulses. He survived—not because he was the strongest, but because he was the most thoughtful.

That’s leadership. That’s decision-making done right.

We need fewer cowboys and more architects—leaders who design sound processes, seek diverse input, and ground decisions in truth. Leaders who act with courage and humility. Leaders who aren’t afraid to say, “I don’t know yet—but together, we’ll find the best path forward.”

So if you want to change your team, change your company, and change your outcomes—start by changing yourself.

Make decisions with clarity. Lead with humility. Inspire with integrity. Because that’s how we make better choices. And that’s how we lead—forward.

改變自己,成為更優秀的決策者

2025 年 4 月,全球金融市場遭遇了劇烈且突如其來的下跌。標普 500 指數重挫近 5%,創下多年來單日最大跌幅,道瓊工業指數也大跌超過 1,600 點。原因來自川普政府突如其來地宣布一系列新關稅政策,包括對所有進口商品徵收 10% 的統一稅率,並對來自中國、日本與歐盟等主要貿易夥伴的商品加重懲罰性關稅。數小時內,中國迅速反擊,對美國商品徵收 34% 的報復性關稅,引發全球貿易戰的擔憂,進一步加劇經濟不確定性。

這次事件無疑是一記當頭棒喝,提醒我們:當決策並非出於事實、數據與專業諮詢,而是根植於偏見、意識形態與政治戲碼時,將會帶來多麼嚴重的後果。這層層錯誤的決策無視專家警告、經濟預測與長期衝擊,與其說是領導,不如說是被自我驅動的表演,假裝成力量的象徵。

回顧我在攻讀 MBA 的過程中,有一個觀念讓我特別深刻:良好的決策,必須建立在事實與科學方法之上。理論與模型固然重要,但更關鍵的是一種尊重證據、追求真理、樂於接受挑戰的思維模式。歷史不斷證明這一點。

舉例來說,在 19 世紀的醫學界,許多外科醫生曾經拒絕洗手這個簡單的舉動,儘管已有明確數據顯示洗手可以大幅減少感染與死亡。當時他們的抗拒不是因為缺乏知識,而是因為缺乏謙遜。一個微小的行為本可拯救數百萬條生命,卻因為傲慢而延誤多年。這正說明了,即使是最聰明的人,若無視數據、緊抓過時觀念,也會做出錯誤的決定。

這也回到今天的主題:如果你現在下定決心為你的團隊或公司設計一套決策架構,那麼還有一件更重要的事需要改變——那就是你自己。更準確地說,是你對自己的認知,以及你呈現給他人作為決策者的形象。

一位決策者,不只是那個拍板定案的人。他必須能夠激勵他人跟隨行動。決策與領導密不可分——而領導力的判斷,終究掌握在他人眼中。沒有追隨者,就不存在所謂的領導者。要讓他人相信你是領導者,你首先必須被他們視為領導者。

然而,我們長期以來被灌輸的領導印象,卻與有效決策背道而馳。我們習慣將領導力與果斷、大膽、絕對自信畫上等號。心理學家 Gary Klein 曾指出,好萊塢塑造的約翰·韋恩(John Wayne)式人物,成了大眾對領導者的典型印象——那個沉著冷靜的牛仔,一語定方向,眾人隨之而行。沒有猶豫,沒有討論,只有行動。

然而,這種形象比激勵人心更具危險性。

這類「牛仔型」領導者仰賴直覺與經驗,避免表現出自我懷疑,也不樂於接受他人意見。他們壓抑異議聲音,助長集體盲思,在同溫層中做出決策。其結果是什麼?政策混亂、風險升高、體系脆弱。

這樣的情況,不只出現在政壇,也發生在董事會與新創公司中。而真正優秀的決策方式,往往正與這種典型背道而馳。現代領導者應鼓勵多元觀點、擁抱不確定性,並願意在需要時改變主意。這不是軟弱,而是智慧的表現。

當然,在開放討論與果斷執行之間取得平衡,從來不容易。正如 Eric Schmidt 所說:「多元觀點,加上一個期限。」Amazon 的領導原則之一是:「提出異議,但一旦決定就全力以赴。」該辯論時就充分辯論,但一旦做出決策,每個人都應全力支持、專注執行。

這樣的領導力,需要我們徹底改變對領導的認知。我們必須拋棄那種「領導就是英雄」的神話,擁抱一種全新的典範——一種建立在流程、價值觀與集體智慧之上的領導模式。看看那些真正成功的 CEO,他們重視協作,重視嚴謹的決策架構。他們不假裝自己擁有所有答案,而是建立起團隊、系統與文化,讓最好的答案能夠自然浮現。他們為決策負責,卻深知力量來自於整體的過程。

Jim Collins 稱這樣的領袖為「第五級領導者」——在堅定執行力與謙遜人格之間取得完美平衡。他們堅毅、有熱情、全力以赴,但同時也腳踏實地、極富自覺。這樣的領袖雖然稀少,但他們真實存在。而他們的樣貌,與約翰·韋恩或唐納·川普完全不同。

如果一定要找個榜樣,那不如學學荷馬筆下的奧德修斯(Odysseus)。當他面對迷惑人心的賽蓮(Sirens)時,他不相信自己的意志能夠抵擋誘惑。他選擇信任流程,命令水手封住耳朵,並把自己綁在桅杆上。他將決策權交給團隊,相信自己設計的系統比一時衝動來得可靠。他得以生還,不是因為他最強,而是因為他最有遠見。

這才是真正的領導力,這才是正確的決策方式。

這個時代需要的不再是牛仔,而是建築師——那些設計穩健流程、蒐集多元觀點、以真相為根基的領導者。他們勇敢又謙遜,他們不怕承認「我現在還不知道」,因為他們相信:「我們一起,一定能找到最好的方向。」

所以,如果你希望改變你的團隊、改變你的公司、改變你的未來——就先從改變自己開始。

用清晰做決策,用謙遜帶領團隊,用誠信鼓舞人心。 因為,這才是做出更好選擇的起點。 而這,就是領導——向前的力量。

The Power of Writing

There was a time in my life when everything felt like it was falling apart. I was heartbroken, overwhelmed, and lost. When my girlfriend broke up with me, it wasn’t just the end of a relationship—it was the collapse of everything I had built emotionally around that connection. Friends tried to help, but none truly understood the storm I was facing inside. I felt like I was drowning in silence.

Then, by what felt like fate, I came across the work of Jordan Peterson. His words cut through the noise. He spoke of the importance of taking responsibility for your suffering, confronting chaos with courage, and most practically—writing. He didn’t describe journaling as a fluffy, feel-good habit. He framed it as a disciplined act of self-confrontation, a way to explore truth and rebuild meaning in life.

So, I picked up a pen. At first, I simply poured out my thoughts—raw, unfiltered, emotional. I wrote about the breakup, my insecurities, my regrets, and my fears. And something unexpected happened: the more I wrote, the lighter I felt. The pages became a mirror reflecting not just pain, but strength I didn’t know I had. I wasn't just coping—I was healing.

It turns out, change doesn’t always come from monumental, sweeping actions. We often think we need to overhaul our lives in one grand gesture to move forward. But the truth? Real transformation begins with tiny tweaks. A journal entry. A five-minute walk. A conscious breath before reacting. These small shifts, when repeated with intention, create powerful momentum. And when these tweaks are aligned with our values, they can lead to lasting, life-altering change.

Think of a gymnast—graceful, powerful, balanced. What makes her capable of performing such impossible routines? Her core. When she wobbles, it’s her core strength that brings her back. Life is no different. When we face challenges, it's our mental and emotional core—our mindset, habits, and self-awareness—that keeps us steady. But to build that strength, we must step out of our comfort zones and attempt the hard things. That’s where growth lives.

Sarah Blakely, the founder of Spanx and a self-made billionaire, shared a beautiful story: every night, her father would ask her, “How did you fail today?” Not because he wanted her to feel ashamed, but because he wanted her to see failure as a sign of courage—proof she was trying, risking, growing. That mindset is a gift. What if we all saw our failures not as flaws, but as badges of effort? What if we praised ourselves for showing up, for trying, for daring?

So often, what holds us back isn’t the world—it’s the story we tell ourselves. “I’ll freeze at that party.” “I’m not good enough for that job.” “They’re all more successful than me.” These are just stories. They feel real, but they’re not truth. They’re fear in disguise. And the longer we believe them, the further they pull us from who we really are.

In emergencies—fires, plane crashes—many people tragically die because they stick to familiar routes, trying to escape the way they entered. They can’t adapt. They can’t see a new path. And isn’t that how we sometimes respond to emotional crises too? Clinging to old beliefs, old patterns, old versions of ourselves—even when they no longer serve us.

But there’s a way out. It starts with awareness. With reflection. With writing.

James Pennebaker, a leading researcher on expressive writing, discovered that when people write about their deepest emotions, their mental and physical health improves dramatically. Lower anxiety. Better immunity. Fewer doctor visits. More meaningful relationships. Why? Because writing helps us make sense of what feels senseless. It gives shape to the chaos. It turns pain into perspective.

I didn’t know it at the time, but when I sat down to write after my breakup, I was doing something powerful. I was reclaiming my voice. I was rewriting the narrative. And over time, one page at a time, I began to rise.

We all carry stories—some are heavy, others unfinished. But the pen is in your hand. You get to choose what comes next. So don’t wait for life to fix itself. Start small. Start honest. Start with a single page.

Write. Reflect. Grow. Heal. And most importantly—keep going.

書寫的力量

曾經,我的人生彷彿一片瓦解。那時的我,心碎、崩潰、迷失。當女朋友和我分手時,那不僅是一段關係的結束,更像是我情感世界整個支撐架構的倒塌。朋友們試著安慰我,但沒有人真正明白我內心正經歷的風暴。我感覺自己像是溺水者,被沉默吞沒。

就像命運安排的一樣,我接觸到了喬登·彼得森(Jordan Peterson)的著作。他的話語穿透了我內心的混亂。他談論人們需要為自己的痛苦負責、勇敢面對混沌,最實際的一點——去書寫。他從不把寫日記描述成一種柔軟、感覺良好的習慣,而是一種自我對話的紀律行動,是探索真相與重建人生意義的方式。

於是,我拿起筆。一開始,我只是傾瀉內心的想法——真實、未經修飾、充滿情緒。我寫下這段分手經歷、我的不安、我的遺憾與恐懼。然後,一件出乎意料的事情發生了:我寫得越多,心情就越輕盈。這些頁面成了我的一面鏡子,不只映照出我的傷痛,也讓我看見了內在的力量——是我從未意識到的力量。我不只是努力撐住,我正在療癒。

事實證明,改變不一定得來自巨大的劇變。我們常以為,想要前進,得一次性徹底改造整個人生。但真相是,真正的轉變來自於微小的調整。一篇日記、一段五分鐘的散步、一個在反應前的深呼吸。當這些小小的舉動成為習慣,並與我們的價值觀一致時,它們就能累積出驚人的力量,並帶來深遠的改變。

想像一位體操選手——優雅、有力、穩定。她之所以能完成近乎不可能的動作,是因為她擁有強大的核心力量。當她失衡時,正是這個核心讓她重新穩住。人生亦然。當我們面對挑戰,真正支撐我們的,是我們的心智與情緒核心——我們的思維方式、習慣,以及自我覺察。而要建立這份穩定,我們必須走出舒適圈,挑戰困難。成長,就藏在那裡。

Spanx 創辦人、自力更生的億萬富翁莎拉·布蕾克利(Sarah Blakely)曾分享一段動人的故事:每天晚餐時,她的父親都會問她:「你今天是怎麼失敗的?」他這麼問,不是為了讓她羞愧,而是為了讓她明白,失敗是一種勇氣的象徵——證明她有在嘗試、有在冒險、有在成長。這樣的思維模式,是一份珍貴的禮物。如果我們都能這樣看待失敗呢?不是缺陷,而是努力的勳章。我們能不能學著為自己的嘗試喝采,為自己的勇氣鼓掌?

其實,真正讓我們止步不前的,往往不是外在的世界,而是我們腦中那個不斷自我懷疑的聲音:「我在派對上一定會冷場」、「我根本不夠格拿到那份工作」、「他們的人生都比我精彩多了」……這些,都是故事。它們聽起來真實,卻不是真理。它們只是偽裝成邏輯的恐懼。當我們越相信這些故事,就越遠離真正的自己。

在緊急情況下——火災、墜機——很多人不幸喪命,並不是因為沒有出口,而是因為他們太過依賴原路逃生。他們無法靈活應變,看不見其他選項。我們在人生的情緒危機中,何嘗不是如此?我們固守著舊有的信念、模式、甚至是舊版本的自己,即便這些早已不再適用。

但,總有一條路能通往出口。起點是覺察,是反思,是書寫。

表達性書寫研究先驅詹姆斯·潘尼貝克(James Pennebaker)發現,當人們寫下內心最深層的情緒時,他們的心理與身體健康都會明顯改善。焦慮減少,免疫力提升,看醫生的次數減少,人際關係也變得更加深刻。為什麼?因為書寫讓我們能夠理解那些看似毫無意義的混亂,它為痛苦賦予結構,讓混亂化為清晰。

我當時並不知道,分手後坐下來寫字的那一刻,我做的是一件多麼強大的事情。我重新找回了自己的聲音,開始重寫屬於自己的故事。時間一頁一頁地流過,我漸漸走出了陰霾。

我們每個人都背負著不同的故事——有些沉重,有些未完。但筆,就在你手中。接下來的章節,由你來決定。所以,別再等待命運安排。從一個小行動開始,從一份誠實開始,從一頁紙開始。

書寫。反思。成長。療癒。最重要的是——繼續前行。

Mixture of Experts in Large Language Models

The rapid evolution of large language models (LLMs) has brought unprecedented capabilities to artificial intelligence, but it has also introduced significant challenges in computational cost, scalability, and efficiency. The Mixture of Experts (MoE) architecture has emerged as a groundbreaking solution to these challenges, enabling LLMs to scale efficiently while maintaining high performance. This blog post explores the concept, workings, benefits, and challenges of MoE in LLMs.

What is Mixture of Experts (MoE)?

The Mixture of Experts approach divides a neural network into specialized sub-networks called "experts," each trained to handle specific subsets of input data or tasks. A gating network dynamically routes inputs to the most relevant experts based on the problem at hand. Unlike traditional dense models where all parameters are activated for every input, MoE selectively activates only a subset of experts, optimizing computational efficiency.

This architecture is inspired by ensemble methods in machine learning but introduces dynamic routing mechanisms that allow the model to specialize in different domains or tasks. For example, one expert might excel at syntax processing while another focuses on semantic understanding.

How Does MoE Work?

MoE operates through two main phases: training and inference.

Training Phase
  1. Expert Training: Each expert specializes in a distinct subset of data or task, refining its capabilities to address specific challenges.
  2. Gating Network Training: The gating network learns to route inputs to the most suitable experts by optimizing a probability distribution over all experts.
  3. Joint Optimization: Both experts and the gating network are trained collaboratively using a combined loss function to ensure harmony between task assignment and overall performance.
Inference Phase
  1. Input Routing: The gating network evaluates incoming data and assigns it to relevant experts.
  2. Selective Activation: Only the most pertinent experts are activated for each input, minimizing resource usage.
  3. Output Combination: Outputs from activated experts are merged into a unified result using techniques like weighted averaging.

Advantages of MoE in LLMs

MoE offers several key benefits that make it particularly effective for large-scale AI applications:

  • Efficiency: By activating only relevant experts for each task, MoE reduces unnecessary computation and accelerates inference.
  • Scalability: MoE allows models to scale to trillions of parameters without proportional increases in computational costs.
  • Specialization: Experts focus on specific tasks or domains, improving accuracy and adaptability across diverse applications like multilingual translation and text summarization.
  • Flexibility: New experts can be added or existing ones modified without disrupting the overall model architecture.
  • Fault Tolerance: The modular nature ensures that issues with one expert do not compromise the entire system's functionality.

Challenges in Implementing MoE

Despite its advantages, MoE comes with significant challenges:

  1. Training Complexity: Coordinating the gating network with multiple experts requires sophisticated optimization techniques. Hyperparameter tuning is more demanding due to the increased complexity of the architecture.

  2. Inference Overhead: Routing inputs through the gating network adds computational steps. Activating multiple experts simultaneously can strain memory and parallelism capabilities.

  3. Infrastructure Requirements: Sparse models demand substantial memory during execution as all experts need to be stored. Deployment on edge devices or resource-constrained environments requires additional engineering efforts.

  4. Load Balancing: Ensuring uniform workload distribution among experts is critical for optimal performance but challenging to achieve.

Applications of MoE in LLMs

MoE is transforming various fields by enabling efficient handling of complex tasks:

Natural Language Processing (NLP)
  • Multilingual Models: Experts specialize in language-specific tasks, enabling efficient translation across dozens of languages (e.g., Microsoft Z-code).
  • Text Summarization & Question Answering: Task-specific routing enhances accuracy by leveraging domain-specialized experts.
Computer Vision
  • Vision Transformers (ViTs): Google’s V-MoEs dynamically route image patches to specialized experts for improved recognition accuracy and speed.

State-of-the-Art Models Using MoE

Several cutting-edge LLMs employ MoE architectures: - OpenAI’s GPT-4 reportedly integrates MoE techniques for enhanced scalability and efficiency. - Mistral AI’s Mixtral 8x7B model leverages MoE for faster inference and reduced computational costs. - Google’s Gemini 1.5 and IBM’s Granite 3.0 showcase innovative applications of MoE in multi-modal AI systems.

Future Directions

The Mixture of Experts architecture is poised for further innovation: - Enhanced routing algorithms for better load balancing and inference efficiency. - Integration with multi-modal systems combining text, images, and other data types. - Democratization through open-source implementations like DeepSeek R1, making advanced AI accessible to a broader audience.

Conclusion

Mixture of Experts represents a paradigm shift in how large language models are designed and deployed. By combining specialization with scalability, it addresses key limitations of traditional dense architectures while unlocking new possibilities for AI applications across domains. As research continues to refine this approach, MoE is set to play a pivotal role in shaping the future of artificial intelligence.

專家混合技術在大型語言模型中的應用

大型語言模型(LLMs)的快速發展為人工智慧帶來了前所未有的能力,但也引入了計算成本、可擴展性和效率方面的重大挑戰。專家混合技術(Mixture of Experts,MoE)架構作為解決這些挑戰的突破性方案,使LLMs能夠在保持高性能的同時有效地擴展。本篇文章將探討MoE的概念、運作方式、優勢及其面臨的挑戰。

什麼是專家混合技術(MoE)?

專家混合技術將神經網絡分成多個專業化的子網絡,稱為「專家」,每個專家都被訓練來處理特定的輸入數據或任務子集。一個門控網絡(Gating Network)根據當前問題動態地將輸入路由到最相關的專家。與傳統密集模型中所有參數對每個輸入都被激活不同,MoE僅選擇性地激活部分專家,從而優化計算效率。

這種架構受機器學習中的集成方法啟發,但引入了動態路由機制,使模型能夠在不同領域或任務中實現專業化。例如,一位專家可能擅長語法處理,而另一位則側重於語義理解。

MoE如何運作?

MoE主要通過訓練和推理兩個階段來運作。

訓練階段
  1. 專家訓練:每個專家專注於特定數據或任務子集,提升其解決特定挑戰的能力。
  2. 門控網絡訓練:門控網絡通過優化所有專家的概率分佈來學習如何將輸入路由到最合適的專家。
  3. 聯合優化:專家和門控網絡使用結合損失函數共同訓練,以確保任務分配與整體性能之間的協調。
推理階段
  1. 輸入路由:門控網絡評估輸入數據並分配給相關的專家。
  2. 選擇性激活:針對每個輸入僅激活最相關的專家,從而最大限度地減少資源使用。
  3. 輸出合併:通過加權平均等技術將激活的專家的輸出合併為統一結果。

MoE在LLMs中的優勢

MoE提供了多項關鍵優勢,使其在大規模AI應用中尤其有效:

  • 效率:僅激活每項任務相關的專家,減少不必要的計算並加快推理速度。
  • 可擴展性:MoE使模型能夠擴展至兆億級參數,而不會導致計算成本成比例增加。
  • 專業化:專家聚焦於特定任務或領域,提升準確性和適應性,例如多語言翻譯和文本摘要。
  • 靈活性:可以添加新的專家或修改現有專家,而不會破壞整體模型架構。
  • 容錯性:模塊化設計確保某一位專家的問題不會影響整個系統功能。

實施MoE面臨的挑戰

儘管具有顯著優勢,MoE仍面臨一些挑戰:

  1. 訓練複雜性
  2. 協調門控網絡與多個專家需要複雜的優化技術。
  3. 超參數調整更加困難,因為架構變得更為複雜。

  4. 推理開銷

  5. 通過門控網絡路由輸入增加了計算步驟。
  6. 同時激活多個專家可能對記憶體和並行能力造成壓力。

  7. 基礎設施需求

  8. 稀疏模型在執行期間需要大量記憶體存儲所有專家。
  9. 在邊緣設備或資源受限環境中部署需要額外工程努力。

  10. 負載均衡

  11. 確保所有專家的工作負載均勻分佈對於最佳性能至關重要,但實現起來具有挑戰性。

MoE在LLMs中的應用

MoE正在改變各個領域,能夠有效處理複雜任務:

自然語言處理(NLP)
  • 多語言模型:專家擅長於特定語言任務,使跨多種語言翻譯更加高效(例如Microsoft Z-code)。
  • 文本摘要與問答:基於任務的路由通過利用領域專業化的專家提高準確性。
電腦視覺
  • 視覺Transformer(ViTs):Google的V-MoEs動態路由圖像塊至專業化的專家,以提升識別準確性和速度。

使用MoE的尖端模型

一些最前沿的大型語言模型採用了MoE架構: - OpenAI 的 GPT-4 據報導整合了MoE技術以提升可擴展性和效率。 - Mistral AI 的 Mixtral 8x7B 模型利用MoE實現更快推理和降低計算成本。 - Google 的 Gemini 1.5 和 IBM 的 Granite 3.0 展示了MoE在多模態AI系統中的創新應用。

未來方向

專家混合技術有望進一步創新: - 改進路由算法以實現更好的負載均衡和推理效率。 - 與多模態系統結合,包括文本、圖像及其他數據類型。 - 通過開源實現(如DeepSeek R1)推動民主化,使先進AI更廣泛地可用。

結論

專家混合技術代表了大型語言模型設計和部署方式的一次範式轉變。通過結合專業化與可擴展性,它解決了傳統密集架構的主要限制,同時為各領域AI應用開啟了新的可能性。隨著研究不斷完善這一方法,MoE有望在塑造人工智慧未來方面發揮重要作用。

Understanding Self-Attention in Large Language Models (LLMs)

Self-attention is a cornerstone of modern machine learning, particularly in the architecture of large language models (LLMs) like GPT, BERT, and other Transformer-based systems. Its ability to dynamically weigh the importance of different elements in an input sequence has revolutionized natural language processing (NLP) and other domains like computer vision and recommender systems. However, as LLMs scale to handle increasingly long sequences, newer innovations like sparse attention and ring attention have emerged to address computational challenges. This blog post explores the mechanics of self-attention, its benefits, and how sparse and ring attention are pushing the boundaries of efficiency and scalability.

What is Self-Attention?

Self-attention is a mechanism that enables models to focus on relevant parts of an input sequence while processing it. Unlike traditional methods such as recurrent neural networks (RNNs), which handle sequences step-by-step, self-attention allows the model to analyze all elements of the sequence simultaneously. This parallelization makes it highly efficient and scalable for large datasets.

The process begins by transforming each token in the input sequence into three vectors: Query (Q), Key (K), and Value (V). These vectors are computed using learned weight matrices applied to token embeddings. The mechanism then calculates attention scores by taking the dot product between the Query and Key vectors, followed by a softmax operation to normalize these scores into probabilities. Finally, these probabilities are used to compute a weighted sum of Value vectors, producing context-aware representations of each token.

How Self-Attention Works

Here’s a step-by-step breakdown:

  1. Token Embeddings: Each word or token in the input sequence is converted into a numerical vector using an embedding layer.
  2. Query, Key, Value Vectors: For each token, three vectors are generated: Query: Represents the current focus or "question" about the token. Key: Acts as a reference point for comparison. Value: Contains the actual information content of the token.
  3. Attention Scores: The dot product between Query and Key vectors determines how relevant one token is to another.
  4. Softmax Normalization: Attention scores are normalized so they sum to 1, ensuring consistent weighting.
  5. Weighted Sum: Value vectors are multiplied by their respective attention weights and summed to produce enriched representations.

To address potential instability caused by large dot product values during training, the scores are scaled by dividing them by the square root of the Key vector's dimensionality—a method known as scaled dot-product attention.

Why Self-Attention Matters

Self-attention offers several advantages that make it indispensable in LLMs:

  • Capturing Long-Range Dependencies: It excels at identifying relationships between distant elements in a sequence, overcoming limitations of RNNs that struggle with long-term dependencies.
  • Contextual Understanding: By attending to different parts of an input sequence, self-attention enables models to grasp nuanced meanings and relationships within text.
  • Parallelization: Unlike sequential models like RNNs, self-attention processes all tokens simultaneously, significantly boosting computational efficiency.
  • Adaptability Across Domains: While initially developed for NLP tasks like machine translation and sentiment analysis, self-attention has also proven effective in computer vision (e.g., image recognition) and recommender systems.

Challenges with Scaling Self-Attention

While self-attention is powerful, its quadratic computational complexity relative to sequence length poses challenges for handling long sequences. For example: - Processing a sequence of 10,000 tokens requires computing a 10,000 x 10,000 attention matrix. - This results in high memory usage and slower computations.

To address these issues, researchers have developed more efficient mechanisms like sparse attention and ring attention.

Sparse Attention: Reducing Computational Complexity

Sparse attention mitigates the inefficiencies of traditional self-attention by reducing the number of attention computations without sacrificing performance.

Key Features of Sparse Attention
  1. Fixed Sparsity Patterns: Instead of attending to all tokens, sparse attention restricts focus to a subset—such as neighboring tokens in a sliding window or specific distant tokens for long-range dependencies.
  2. Learned Sparsity: During training, the model learns which token interactions are most important, effectively pruning less significant connections.
  3. Block Sparsity: Groups of tokens are processed together in blocks, reducing the size of the attention matrix while retaining contextual understanding.
  4. Hierarchical Structures: Some implementations use hierarchical or dilated patterns to capture both local and global dependencies efficiently.
Advantages
  • Lower Memory Requirements: By limiting the number of token interactions, sparse attention reduces memory usage significantly.
  • Improved Scalability: Sparse patterns allow models to handle longer sequences with reduced computational overhead.
  • Task-Specific Optimization: Sparse patterns can be tailored to specific tasks where certain dependencies are more critical than others.
Example Use Case

In machine translation, sparse attention can focus on relevant parts of a sentence (e.g., verbs and subjects), ignoring less critical words like articles or conjunctions. This targeted approach maintains translation quality while reducing computational costs.

Ring Attention: Near-Infinite Context Handling

Ring attention is a cutting-edge mechanism designed for ultra-long sequences. It distributes computation across multiple devices arranged in a ring-like topology, enabling efficient processing of sequences that traditional attention mechanisms cannot handle.

How Ring Attention Works
  1. Blockwise Computation: The input sequence is divided into smaller blocks. Each block undergoes self-attention and feedforward operations independently.
  2. Ring Topology: Devices (e.g., GPUs) are arranged in a circular structure. Each device processes its assigned block while passing key-value pairs to the next device in the ring.
  3. Overlapping Communication and Computation: While one device computes attention for its block, it simultaneously sends processed data to the next device and receives new data from its predecessor.
  4. Incremental Attention: Attention values are computed incrementally as data moves through the ring, avoiding the need to materialize the entire attention matrix.
Advantages
  • Memory Efficiency: By distributing computation across devices and avoiding full matrix storage, ring attention drastically reduces memory requirements.
  • Scalability: The mechanism scales linearly with the number of devices, enabling near-infinite context sizes.
  • Efficient Parallelism: Overlapping communication with computation minimizes delays and maximizes hardware utilization.
Example Use Case

Consider processing an entire book or legal document where context from distant sections is crucial for understanding. Ring attention enables LLMs to maintain coherence across millions of tokens without running into memory constraints.

Comparison Table

Feature Traditional Self-Attention Sparse Attention Ring Attention
Computational Complexity Quadratic Linear or Sub-quadratic Distributed Linear
Focus Area All tokens Selective focus on subsets Entire sequence via distributed devices
Scalability Limited Moderately long sequences Near-infinite sequences
Memory Efficiency High memory usage Reduced memory via sparsity Distributed memory across devices
Best Use Case Short-to-medium sequences Medium-to-long sequences Ultra-long contexts

Conclusion

Self-attention has transformed how machines process language and other sequential data by enabling dynamic focus on relevant information within an input sequence. Sparse attention builds on this foundation by optimizing computations for moderately long sequences through selective focus on key interactions. Meanwhile, ring attention pushes boundaries further by enabling efficient processing of ultra-long contexts using distributed computation across devices.

As LLMs continue to evolve with increasing context windows and applications across diverse domains—from summarizing books to analyzing legal documents—these innovations will play an essential role in shaping their future capabilities. Whether you're working on NLP tasks with dense local dependencies or tackling projects requiring vast context windows, understanding these mechanisms will help you leverage modern AI technologies effectively.

大型語言模型(LLM)中的自注意力機制

自注意力(Self-Attention)是現代機器學習的核心技術,尤其是在像 GPT、BERT 和其他基於 Transformer 的大型語言模型(LLM)架構中。它能夠動態地衡量輸入序列中不同元素的重要性,徹底改變了自然語言處理(NLP)以及計算機視覺和推薦系統等領域。然而,隨著 LLM 的擴展以處理越來越長的序列,稀疏注意力(Sparse Attention)環狀注意力(Ring Attention)等創新技術應運而生,以解決計算挑戰。本文將探討自注意力的工作原理、優勢,以及稀疏和環狀注意力如何突破效率和可擴展性的界限。

什麼是自注意力?

自注意力是一種機制,使模型在處理輸入序列時能夠專注於相關部分。與傳統方法如循環神經網絡(RNN)逐步處理序列不同,自注意力允許模型同時分析序列中的所有元素。這種並行化使其對於大數據集非常高效且可擴展。

該過程首先將輸入序列中的每個標記轉換為三個向量:查詢(Query, Q)鍵(Key, K)值(Value, V)。這些向量是通過對標記嵌入應用學習的權重矩陣計算得出的。然後,自注意力通過查詢和鍵向量的點積計算注意力分數,並通過 softmax 操作將這些分數歸一化為概率。最後,這些概率用於計算值向量的加權總和,生成每個標記的上下文感知表示。

自注意力如何運作

以下是詳細步驟:

  1. 標記嵌入:輸入序列中的每個單詞或標記使用嵌入層轉換為數值向量。
  2. 查詢、鍵和值向量:對於每個標記,生成三個向量: 查詢(Query):表示當前對標記的“問題”或關注。 鍵(Key):充當比較的參考點。 值(Value):包含標記的實際信息內容。
  3. 注意力分數:查詢和鍵向量之間的點積決定了一個標記與另一個標記的相關性。
  4. Softmax 歸一化:注意力分數被歸一化,使其總和為 1,確保權重一致。
  5. 加權總和:值向量乘以各自的注意力權重並相加,生成增強表示。

為了解決訓練期間由於點積值過大導致的不穩定性,分數通過除以鍵向量維度平方根進行縮放,即所謂的縮放點積注意力

自注意力的重要性

自注意力提供了多項優勢,使其在 LLM 中不可或缺:

  • 捕捉長距依賴性:它在識別序列中遠距元素之間的關係方面表現出色,克服了 RNN 在長期依賴性上的限制。
  • 上下文理解:通過關注輸入序列中的不同部分,自注意力使模型能夠掌握文本中的細微含義和關係。
  • 並行化處理:與 RNN 等順序模型不同,自注意力同時處理所有標記,大幅提高計算效率。
  • 跨領域適應性:雖然最初是為 NLP 任務(如機器翻譯和情感分析)開發,但自注意力在計算機視覺(如圖像識別)和推薦系統中也表現出色。

擴展自注意力的挑戰

儘管自注意力功能強大,但其相對於序列長度的二次計算複雜度在處理長序列時會帶來挑戰。例如: - 處理 10,000 個標記的序列需要計算一個 10,000 x 10,000 的注意力矩陣。 - 這導致高內存使用率和較慢的計算速度。

為了解決這些問題,研究人員開發了更高效的機制,如稀疏注意力和環狀注意力。

稀疏注意力:降低計算複雜度

稀疏注意力通過減少計算次數來緩解傳統自注意力的低效問題,同時保持性能。

稀疏注意力的主要特徵
  1. 固定稀疏模式:稀疏注意力僅關注子集,例如滑動窗口中的鄰近標記或遠距依賴的重要標記,而非所有標記。
  2. 學習稀疏性:在訓練期間,模型會學習哪些標記交互最重要,有效地修剪不太重要的連接。
  3. 塊狀稀疏性:一組標記被分組並一起處理,減少了矩陣大小,同時保留上下文理解。
  4. 層次結構:一些實現使用層次或膨脹模式來高效捕捉局部和全局依賴性。
優勢
  • 降低內存需求:通過限制標記交互次數,稀疏注意力顯著降低內存使用率。
  • 提高可擴展性:稀疏模式使模型能夠以較低計算成本處理更長的序列。
  • 任務特定優化:稀疏模式可以針對特定任務進行定制,例如翻譯或摘要,其中某些依賴性更為重要。
示例應用

在機器翻譯中,稀疏注意力可以專注於句子的相關部分(例如動詞和主語),忽略不太重要的詞語,如冠詞或連詞。這種針對性方法在保持翻譯質量的同時降低了計算成本。

環狀注意力:近乎無限上下文處理

環狀注意力是一種尖端機制,用於超長序列。它將計算分佈到多個設備上,這些設備排列成類似環狀拓撲結構,使得傳統機制無法處理的超長序列能夠高效運行。

環狀注意力如何運作
  1. 塊狀計算:輸入序列被分割成較小塊,每塊獨立進行自注意力和前饋操作。
  2. 環狀拓撲結構:設備(如 GPU)排列成圓形結構,每個設備處理其分配的塊,同時將鍵值對傳遞給下一設備。
  3. 通信與計算重疊進行:當一個設備為其塊計算注意力時,它同時向下一設備發送已處理數據並接收前一設備的新數據。
  4. 增量式注意力計算:隨著數據在環中移動,逐步計算出注意值,避免需要實現完整矩陣。
優勢
  • 內存效率高:通過分佈式計算並避免完整矩陣存儲,環狀注意力顯著降低內存需求。
  • 可擴展性強:該機制隨設備數量線性擴展,使得上下文大小幾乎無限。
  • 高效並行化處理:通信與計算重疊最大限度地減少延遲並提高硬件利用率。
示例應用

考慮處理整本書或法律文件,其中需要從遠距部分獲取上下文才能理解。環狀注意力使 LLM 能夠在不受內存限制影響的情況下保持數百萬個標記的一致性。

比較表

特徵 傳統自注意力 稀疏注意力 環狀注意力
計算複雜度 二次複雜度 線性或次二次複雜度 分佈式線性
關注範圍 所有標記 子集選擇 通過分佈式設備處理整個序列
可擴展性 有限 中等長度序列 幾乎無限長度序列
內存效率 高內存使用 通過稀疏降低內存 分佈式內存
最佳應用場景 短至中等長度序列 中等至長序列 超長上下文

結論

自注意力通過使模型能夠動態專注於輸入序列中的相關信息,徹底改變了機器如何處理語言及其他順序數據。稀疏注意力在此基礎上進一步發展,通過選擇關鍵交互來優化中等長度序列的計算。而環狀注意力則更進一步,利用分佈式設備高效處理超長上下文。

隨著 LLM 不斷發展以應對越來越大的上下文窗口及跨領域應用——從書籍摘要到法律文件分析——這些創新技術將在塑造其未來能力方面發揮至關重要作用。不論您是在研究具有密集局部依賴性的 NLP 任務還是需要廣泛上下文窗口的大型項目,理解這些機制都將幫助您有效利用現代 AI 技術。