構建機器學習工具一年得到的四個教訓

InfoQ 2021-08-15 12:14:07 阅读数:350

本文一共[544]字,预计阅读时长:1分钟~
工具 一年 得到
{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"關於如何構建機器學習工具、未來的需求和為什麼領域專家在人工智能的未來中扮演重要的角色,我們想與大家分享一些最令人驚訝的經驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在過去的一年裏, "},{"type":"link","attrs":{"href":"https:\/\/humanloop.com\/demo","title":null,"type":null},"content":[{"type":"text","text":"Humanloop"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 一直在開發"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"一種用於訓練和部署自然語言處理模型的新工具"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。我們已經幫助律師、客服人員、市場營銷人員和軟件開發人員團隊快速訓練出能够理解語言的人工智能模型,並立即使用它們。在使用"},{"type":"link","attrs":{"href":"https:\/\/humanloop.com\/blog\/why-you-should-be-using-active-learning\/","title":null,"type":null},"content":[{"type":"text","text":"主動學習"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"時,我們開始將注意力集中在减少注釋數據的需求上,但是很快發現需要更多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"我們真正需要的是一組新的工具和工作流,從第一原則出發,這些工具和工作流是用來處理人工智能工作挑戰的。這裏有一些我們學到的東西。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. 主題專家的影響力不亞於數據科學家"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"2011 年初,對深度學習專業知識的需求如此之高,以至於 "},{"type":"link","attrs":{"href":"https:\/\/www.wired.com\/story\/secret-auction-race-ai-supremacy-google-microsoft-baidu\/","title":null,"type":null},"content":[{"type":"text","text":"Geoff Hinton 能够以 4400 萬美元的價格將自己賣給穀歌"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 。今天不再是這樣了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"2011 年的許多難題都已商業化。通過導入庫,你可以使用最先進的模型,並且大多數研究的突破性成果都會很快被納入。盡管我已經拿到了深度學習的博士學比特,但我仍然對標准模型在廣泛的使用案例中的開箱即用錶現感到驚訝。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"構建機器學習服務仍然很難,但最大的挑戰是獲取正確的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"或許,令人驚訝的是,機器學習技術方面的支持已經不如領域的專業知識有用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"舉例來說,我們與一個團隊合作,他們想知道 80000 多項曆史法律判决的結果。手工處理這些文件是完全不可行的,那要花上幾十萬美元的律師時間。要解决這個問題,光靠數據科學家時不行的。一比特律師是我們真正需要的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在數據科學中,傳統的工作流程將數據注釋看作模型訓練的第一步。我們知道,將數據注釋\/數據管護(data curation)放在工作流的中心比特置實際上會讓你更快的得到結果。由主題專家擔任領導角色,與數據科學家更容易合作。而且我們也看到,這會產生更高的數據質量和更高的模型質量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"兩比特律師組成的團隊為 Humanloop 平臺上的數據進行了注釋,並用"},{"type":"link","attrs":{"href":"https:\/\/humanloop.com\/blog\/why-you-should-be-using-active-learning\/","title":null,"type":null},"content":[{"type":"text","text":"主動學習"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"自動並行的方式來訓練模型。僅用了幾個小時,律師們就訓練出了一個模型,它能提供所有 80000 項判决結果,而這些結果完全不需要數據科學家的參與。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"不只是律師,我們見過醫生團隊為訓練醫療聊天機器人所作的注釋;金融分析師為命名實體識別所作的標記,以及科學家對數據進行注釋,以便大規模檢索論文。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2. 第一次迭代總是在標簽分類上"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"機器學習模型的訓練通常從標記數據集開始。在我們最初構建 Humanloop 平臺時,我們認為選擇一個標簽分類法是在項目開始時做的事情,然後就完成了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"如果不探索數據,大多數團隊都低估了定義好的標簽分類是多麼困難。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我們很快意識到,一旦團隊開始注釋數據,他們就會發現最初對於他們想要的分類的猜測是錯誤的。這些數據中常常會有他們從未考慮過的分類,或者是一些非常罕見的,所以最好將它們合並到一個更大的分類。團隊會驚訝的發現,對於即使是簡單的分類,常常很難對其含義達成一致意見。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"在項目開始後,數據科學家、項目經理和標注員之間幾乎總是在討論如何更新標簽分類的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"將數據整理置於機器學習工作流的中心,可以讓不同的利益相關者快速達成一致。為簡化這一過程,我們為項目經理增加了在注釋期間編輯其標簽分類的能力。Humanloop 模型和主動學習系統可以自動遵循對標簽的任何修改。讓團隊能够對示例數據點進行標記、評論和討論。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3. 快速反饋的投資回報率很高"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"對於我們創建的主動學習平臺,一個出乎意料的好處就是,它可以讓項目快速原型化並消除風險。在 Humanloop 平臺上,通過團隊的注釋,對模型進行了實時訓練,並提供了模型性能的統計數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"許多機器學習項目都會失敗。根據 "},{"type":"link","attrs":{"href":"https:\/\/info.algorithmia.com\/hubfs\/2020\/Reports\/2021-Trends-in-ML\/Algorithmia_2021_enterprise_ML_trends.pdf?hsLang=en-us","title":null,"type":null},"content":[{"type":"text","text":"algorithmia 的數據"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" ,多達 80% 的項目從未投產。出現這種情况通常是因為目標不明確,輸入的數據質量太差,無法預測輸出,或者模型陷入困境,等待生產。高層管理人員變得不願意為不確定性很高的項目投入資源,因而錯失了很多好機會。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"盡管我們沒有計劃,但我們意識到團隊正在利用 Humanloop 的早期快速反饋來評估項目的可行性。他們可以上傳小的數據集,然後給一些例子貼上標簽,這樣就能了解到他們的項目會有多好。這就是說,一些可能會失敗的項目沒有繼續進行,而另一些項目很快就會獲得更多的資源,因為團隊知道它們會成功。這類早期探索通常由完全沒有機器學習背景的產品經理來完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4. 機器學習工具應當以數據為中心,但以模型為依托"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"目前大部分訓練和部署機器學習(MLOps)的工具都是針對傳統軟件構建的。它們專注於代碼而非數據,它們的目標是很窄的機器學習開發管道。有一些 MLOps 工具可以用於監控、特征存儲、模型版本、數據集版本、模型訓練、評估存儲等等。幾乎沒有任何一種工具可以方便地查看和理解系統所學到的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/youtu.be\/06-AZXmwHjo?t=1702","title":null,"type":null},"content":[{"type":"text","text":"吳恩達"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(Andrew Ng)和 "},{"type":"link","attrs":{"href":"https:\/\/karpathy.medium.com\/software-2-0-a64152b37c35","title":null,"type":null},"content":[{"type":"text","text":"Andreij Kaparthy"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 等人最近一直在呼籲使用以數據為中心的機器學習工具。人們完全同意,機器學習要求團隊更多地關注他們的數據集,但是我們發現這些工具的最佳版本需要與模型緊密結合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"在 Humanloop 平臺上,我們看到的大部分好處來自數據和模型之間的相互作用"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"在探索階段"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":該模型顯示出罕見的分類,並提供有關分類學習難度的反饋。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"在訓練階段"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":模型找到具有最高價值的數據標注,使模型以較少的標簽獲得高性能模型。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"在審查階段"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":該模型使得發現錯誤注釋變得更加容易。Humanloop 平臺顯示出模型的預測與領域專家的標注員不一致的例子,並且具有很高的置信度。發現和糾正錯誤的數據點往往是提高模型性能的最有效途徑。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"結合數據和模型構建過程在機器學習開發過程的每一個階段都有好處。對模型進行注釋學習後,部署不再是一個“瀑布”時刻。模型是不斷學習的,可以輕松地保持更新。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一年來,我們認為我們已經在建立讓機器學習變得更簡單的新工具方面取得了重大進展,首先是自然語言處理。如今,很多行業的專家都對人工智能模型的訓練做出了貢獻,並且很高興看到基於 Humanloop 的新應用程序。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Humanloop 是一家從事機器學習和人工智能的初創公司,該公司的產品 Humanloop 是一個訓練和部署自然語言處理的人工智能數據標記工具,為用戶的模型提供 API,用戶可以使用該工具更好地可視化和理解其數據,從而拓展客戶人力資源。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"https:\/\/humanloop.com\/blog\/4-lessons-from-a-year-building-tools-for-machine-learning"}]}]}
版权声明:本文为[InfoQ]所创,转载请带上原文链接,感谢。 https://gsmany.com/2021/08/20210815121351586A.html