光華講壇——社會(huì)名流與企業(yè)家論壇第6629期
主題:Distributed Hard Screening for Massive Data海量數(shù)據(jù)的分布式硬篩選
主講人:西安交通大學(xué) 徐晨教授
主持人:西南財(cái)經(jīng)大學(xué) 常晉源教授
時(shí) 間:11月1日 15:00-16:00
舉辦地點(diǎn):西南財(cái)經(jīng)大學(xué)光華校區(qū)光華樓1003會(huì)議室
主辦單位:數(shù)據(jù)科學(xué)與商業(yè)智能聯(lián)合實(shí)驗(yàn)室 統(tǒng)計(jì)學(xué)院 科研處
主講人簡(jiǎn)介:
徐晨教授畢業(yè)于加拿大不列顛哥倫比亞大學(xué)統(tǒng)計(jì)系,師從國(guó)際知名統(tǒng)計(jì)學(xué)家加拿大皇家科院院士陳嘉驊。畢業(yè)后赴美國(guó)賓州州立大學(xué)做博士后研究?,F(xiàn)任西安交通大學(xué)特聘教授、加拿大渥太華大學(xué)長(zhǎng)聘副教授。徐晨教授長(zhǎng)期從事大數(shù)據(jù)統(tǒng)計(jì)機(jī)器學(xué)習(xí)的基礎(chǔ)理論與方法研究,在大數(shù)據(jù)特征篩選/降維、再抽樣理論與方法、分布式統(tǒng)計(jì)分析等領(lǐng)域取得系統(tǒng)性創(chuàng)新成果,做出多個(gè)原創(chuàng)性貢獻(xiàn)。在統(tǒng)計(jì)學(xué)頂刊Journal of American Statistical Association、機(jī)器學(xué)習(xí)頂刊Journal of Machine Learning Research、IEEE Transactions on Pattern Analysis & Machine Intelligence和綜合學(xué)科類(lèi)頂刊National Science Review等國(guó)際著名雜志發(fā)表研究論文40余篇;主持加拿大自然科學(xué)探索基金、中國(guó)國(guó)家重點(diǎn)研發(fā)計(jì)劃項(xiàng)目,參與中國(guó)國(guó)家自然科學(xué)基金重大項(xiàng)目、鵬城實(shí)驗(yàn)室重大科研攻關(guān)任務(wù)項(xiàng)目。研究獲得加拿大統(tǒng)計(jì)學(xué)會(huì)最佳學(xué)生論文獎(jiǎng)(2010)、加拿大國(guó)家統(tǒng)計(jì)科學(xué)研究所杰出博士后導(dǎo)師獎(jiǎng)(2021)、粵港澳大灣區(qū)首屆國(guó)際算法算例大賽冠軍(2022)等?,F(xiàn)任統(tǒng)計(jì)學(xué)權(quán)威雜志JASA、EJS的副主編,曾任CJS、Neurocomputing、Survey Sampling等國(guó)際知名雜志的編委或客座主編。
內(nèi)容簡(jiǎn)介:
Feature screening is a powerful tool for modeling high dimensional data. It aims at reducing the dimensionality by removing most irrelevant features before an elaborative analysis. When a dataset is massive in both sample size N and dimensionality p, classic screening methods become inefficient or even infeasible due to the high computational burden. In this paper, we propose a distributed screening method for the large-N-large-p setup. The new method is built upon an ADMM updating procedure of L0-constrained consensus regression, where data are processed in m manageable segments by multiple local computers. In the procedure, the local computers improve screening results iteratively by communicating with each other via a global computer. The joint effects between features are also accounted naturally in the screening process. It thus provides a computationally viable and reliable route for screening features with big data. Under mild conditions, we show that the proposed updating procedure is convergent and leads to an accurate screening even when m = o(N). Moreover, with a proper starting value, the procedure enjoys the sure screening property within finite number of iterations. The promising performance of the method is supported by extensive numerical studies.
特征篩選是高維數(shù)據(jù)建模的有力工具。其目標(biāo)是在進(jìn)行詳盡分析之前通過(guò)去除最不相關(guān)的特征來(lái)對(duì)數(shù)據(jù)降維。當(dāng)數(shù)據(jù)集在樣本量N和維度p都非常大時(shí),傳統(tǒng)的篩選方法由于過(guò)高的計(jì)算負(fù)擔(dān)而變得低效甚至不可行。針對(duì)這一問(wèn)題,本文提出了一種針對(duì)“大N大p”情況的分布式篩選方法。這種新方法是基于A(yíng)DMM更新程序構(gòu)建的L0約束下的一致回歸,其中數(shù)據(jù)可拆分為m個(gè)可控分段并分別由多個(gè)本地計(jì)算機(jī)處理。在此過(guò)程中,本地計(jì)算機(jī)通過(guò)與全局計(jì)算機(jī)通信來(lái)迭代改進(jìn)篩選結(jié)果。在篩選時(shí),特征之間的交互效應(yīng)也自然地被納入考量。因此,該方法為大數(shù)據(jù)特征篩選提供了可行且可靠的途徑。在一般的條件下,我們證明了所提出的更新程序是收斂的,并且即使當(dāng)m=o(N)時(shí)也能實(shí)現(xiàn)準(zhǔn)確的篩選。此外,當(dāng)選取了適當(dāng)?shù)某跏贾岛?,該程序在有限次迭代中具備確定篩選性。大量數(shù)值研究證明了該方法具備良好性能。