پنجاه‌سال علم داده

وحیدی اصل, محمد‌قاسم

doi:10.30504/mct.2023.1398.1982

پنجاه‌سال علم داده

نوع مقاله : ترجمه

نویسنده

محمد‌قاسم وحیدی اصل

دانشگاه شهید بهشتی، دانشکدهٔ علوم ریاضی

10.30504/mct.2023.1398.1982

چکیده

بیش از ‎50‎ سال پیش، جان توکی فراخوانی برای بازسازی آمار دانشگاهی داد. او در مقالۀ ‎«‎آیندۀ تحلیل داده‌ها‎»‎ به وجود علمی اشاره کرد که آن زمان به رسمیت شناخته نشده بود و موضوع مورد توجه آن یادگیری از داده‌ها یا ‎«‎تحلیل داده‌ها‎»‎ بود. پدیده‌ای مربوط به این اواخر ‎‏که رو به گسترش نیز است‌، ظهور برنامه‌های ‎«‎علم داده‎»‎ در دانشگاه‌های بزرگ است. این مقاله در پی مرور برخی از اجزای ‎«‎برهۀ علم دادهٔ» کنونی، ازجمله بر اظهارنظرهای اخیر دربارۀ علم داده در رسانه‌های همگانی، است که آیا علم داده واقعاً با آمار تفاوت دارد یا خیر و اگر دارد، چگونه. برداشت فعلی از علم داده به‌‌مثابهٔ زبرمجموعه‌ای از رشته‌های آمار و یادگیری ماشین است که مقداری فنّاوری به منظور ‎«‎ارتقای مقیاس‎»‎ آن برای ‎«‎داده‌های بزرگ‎»‎ به آن افزوده شده است. انگیزۀ انتخاب این زبرمجموعه، پیشرفت‌های تجاری است و نه اندیش‌ورزانه. چنین انتخابی محتملاً رویدادهای اندیش‌ورزانۀ مهم ‎50‎ سال آینده را نادیده خواهد گرفت. ازآنجاکه خودِ کلِّ علم به‌زودی بدل به داده می‌شود که می‌توان آن را کاوید، انقلاب در شرف وقوع در علم داده محدود به ‎«‎ارتقای مقیاس‎»‎ نبوده، بلکه در مقابل، در ظهور مطالعات علمی تحلیل داده‌ها در سطح کل علم است. دیدگاهی از علم داده را بر‌اساس فعالیت‌های افرادی که ‎«‎از داده‌ها یاد می‌گیرند‎»‎ ارائه می‌کنم و یک رشتۀ دانشگاهی را که به بهبود این فعالیت به شیوه‌ای مبتنی‌بر شواهد اختصاص دارد توصیف می‌کنم. این رشتۀ جدید وسعت‌بخشی علمی بهتری از آمار و یادگیری ماشین در مقایسه با ابتکارعمل‌های امروزی در علم داده است، و هم‌زمان ‌توان آن را دارد که همان اهداف کوتاه‌مدت را برآورده کند.

کلیدواژه‌ها

علم داده

یادگیری ماشین

آمار

تحلیل داده‌ها

مدل‌بندی پیشگوگر

موضوعات

Donoho‎, ‎D.‎, ‎50 Years of Data Science‎, ‎J‎. ‎Comput‎. ‎Graph‎. ‎Statist., 26 (2017)‎, ‎no. 4‎, ‎745-766.

[1] Barlow, M., The Culture of BigData, Sebastopol, OReilly Media, Inc., OCA, 2013.
[2] Baumer, B., A Data science course for undergraduates: Thinking with data, The American Statistician, 69 (2015), 334-342.
[3] Bernau, C., Riester, M., Boulesteix, A.-L., Parmigiani, G., Huttenhower, C., Waldron, L., Trippa, L., Cross-Study validation for the assessment of prediction algorithms, Bioinformatics, 30 (2014), i105-i112.
[4] Breiman, L., Statistical modeling: The two cultures, Statistical Science, 16 (2001), 199-231.
[5] Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., Munafò, M. R., Power failure: Why small sample size undermines the reliability of neuroscience, Nature Reviews Neuroscience, 14 (2013), 365-376.
[6] Carp, J., The Secret Lives of Experiments: Methods Reporting in the fMRI Literature, Neuroimage, 63, 289–300, (2012). [759]
[7] Chambers, J. M., Greater or lesser statistics: A choice for future research, Statistics and Computing, 3 (1993), 182-184.
[8] Chavalarias, D., Wallach, J., Li, A., Ioannidis, J. A., Evolution of reporting p values in the biomedical literature, 1990–2015, Journal of the American Medical Association, 315 (2016), 1141-1148.
[9] Cleveland, W. S., The eElements of Graphing Data, Monterey, Wadsworth Advanced Books and Software, CA. 1985.
[10] Cleveland, W. S., Visualizing Data, Hobart Press, Summit, NJ, 1993.
[11] Summit, NJ: Data science: An action plan for expanding the technical areas of the field of statistics, Su International Statistical Review, 69 (2001), 21-26.
[12] Coale, A. J., Stephan, F. F., The case of the indians and the teen-age widows, Journal of the American Statistical Association, 57 (1962), 338-347.
[13] Collins, F., and Tabak, L. A., Policy: NIH plans to enhance reproducibility, Nature, 505 (2014), 612-613.
[14] Cook, D., Swayne, D. F., Interactive and Dynamic Graphics for Data Analysis:With R and Gobi, Springer Science & Business Media, New York, 2007.
[15] Dettling, M., Bag boosting for tumor classification with gene expression data, Bioinformatics, 20 (2004), 3583-3593.
[16] Donoho, D., Jin, J., Higher criticism thresholding: Optimal feature selection when useful features are rare and weak, Proceedings of the National Academy of Sciences, 105 (2008), 14790-14795.

[17] Donoho, D. L., Maleki, A., Rahman, I. U., Shahram, M., Stodden, V., Reproducible research in computational harmonic analysis, Computing in Science and Engineering, 11 (2009), 8-18.
[18] Fisher, R. A., The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7 (1936), 179-188.
[19] Freire, J., Bonnet, P., Shasha, D., Computational reproducibility: State-of-the-art, challenges, and database research opportunities, in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, ACM, 2012, 593-596.
[20] Gavish, M., Three dream applications of verifiable computational results, Computing in Science & Engineering, 14 (2012), 26-31.
[21] Gavish, M., Donoho, D., A universal identifier for computational results, Procedia Computer Science, 4 (2011), 637-647.
[22] Hand, D. J., Classifier technology and the illusion of progress, Statistical Science, 21 (2006), 1-14.
[23] Harris, H., Murphy, S.,Vaisman, M., Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work, Sebastopol, O’Reilly Media, Inc, CA, 2013.
[24] Heroux, M. A., Editorial: ACM TOMS replicated computational results initiative, ACM Transactions on Mathematical Software, 13 (2015), 41, 1-13.
[25] Horton, N. J., Baumer, B. S., Wickham, H., Taking a chance in the classroom: Setting the stage for data science: Integration of data management skills in introductory and second courses in statistics, CHANCE, 28 (2015), 40-50.
[26] Hotelling, H., The teaching of statistics, Ann. of Math. Statistics, 11 (1940), 457-470.
[27] Ioannidis, J. P. A., Contradicted and initially stronger effects in highly cited clinical research, Journal of the American Medical Association, 294 (2005), 218–228.
[28] Ioannidis, J. P. A., Non-replication and inconsistency in the genome-wide association setting, Human Heredity, 64 (2007), 203–213.
[29] Ioannidis, J. P. A., Why most discovered true associations are inflated, Epidemiology, 19 (2008), 640-648.
[30] Iverson, K. E., A personal view of APL, IBM Systems Journal, 30 (1991), 582-593.
[31] Jager, L. R., Leek, J. T., An estimate of the science-wise false discovery rate and application to the top medical literature, Biostatistics, 15 (2014), 1-12.
[32] Liberman, M., Fred Jelinek, Computational Linguistics, 36 (2010), 595-599.
[33] Madigan, D., Stang, P. E., Berlin, J. A., Schuemie, M., Overhage, J. M., Suchard, M. A., Dumouchel, B., Hartzema, A. G., Ryan, P. B., A systematic statistical approach to evaluating evidence from observational studies, Annual Review of Statistics and Its Application, 1 (2014), 11–39.
[34] Marchi, M., Albert, J., Analyzing Baseball Data with R, CRC Press, Boca Raton, FL, 2013.
[35] McNutt, M., Reproducibility, Science, 343 (2014), 229.
[36] Mosteller, F., Tukey, J. W., Data analysis, including statistics, in Handbook of Social Psychology, G. Lindzey, E.Aronson, eds. Addison-Wesley, Reading, MA, 1968, 80-203.
[37] Open Science Collaboration et al., Estimating the reproducibility of psychological science, Science, 349 (2015), aac4716.
[38] Pan, Z., Trikalinos, T. A., Kavvoura, F. K., Lau, J., Ioannidis, J. P. A., Local literature bias in genetic epidemiology: An downloaded by empirical evaluation of the chinese literature, PLoS Medicine, 2 (2005), 1309.
[39] Peng, R. D., Reproducible research and biostatistics, Biostatistics, 10 (2009), 405–408.
[40] Prinz, F., Schlange, T., Asadullah, K., Believe it or not: How much can we rely on published data on potential drug targets?, Nature Reviews Drug Discovery, 10 (2011), 712-712.
[41] Ryan, P. B., Madigan, D., Stang, P. E., Overhage, J. M., Racoosin, J. A., Hartzema, A. G., Empirical assessment of methods for risk identification in healthcare data: Results fromthe experiments of the observational medical outcomes partnership, Statistics in Medicine, 31 (2012), 4401-4415.

[42] Stodden, V., Reproducible research: Tools and strategies for scientific computing, Computing in Science and Engineering, 14 (2012), 11-25.
[43] Stodden, V., Guo, P., Ma, Z., Toward reproducible computational research: An empirical analysis of data and code policy adoption by journals, PLoS ONE, 8 (2013), e67111.
[44] Stodden, V., Leisch, F., Peng, R. D., eds., Implementing Reproducible Research, Chapman & Hall/CRC, Boca RatonFL, 2014.
[45] Stodden, V., Miguez, S., Best practices for computational science: Software infrastructure and environments for reproducible and extensible research, Journal of Open Research Software,1 (2014), e21.
[46] Sullivan, P. F., Spurious genetic associations, Biological Psychiatry, 61 (2007), 1121-1126.
[47] Tango, T. M., Lichtman, M. G., Dolphin, A. E., The Book: Playing the Percentages in Baseball, Potomac Books, Inc, Lincoln, NE, 2007.
[48] Tukey, J. W., The future of data analysis, The Annals of Mathematical Statistics, 33 (1962), 1-67.
[49] Tukey, J. W., Exploratory Data Analysis, Addison-Wesley, Reading, MA, 1977.
[50] Tukey, J. W., The Collected Works of JohnW. Tukey, Multiple Comparisons, vol. 1, H. I. Braun, eds., Wadsworth&Brooks/Cole, Pacific Grove, CA, 1994.
[51] Wandell, B. A., Rokem, A., Perry, L. M., Schaefer, G., Dougherty, R. F., Quantitative biology – Quantitative methods, Bibliographic code (2015), available at arXiv: 150206900W.
[52] Wickham, H., Reshaping data with the reshape package, Journal of Statistical Software, 21 (2007), 1-20.
[53] Wickham, H., ggplot2, Wiley Interdisciplinary Reviews: Computational Statistics, 3 (2011), 180185.
[54] Wickham, H., The split-apply-combine strategy for data analysis, Journal of Statistical Software, 40 (2011), 1-29.
[55] Wickham, H., Tidy data, Journal of Statistical Software, 59 (2014), 1-23.
[56] Wilkinson, L., The Grammar of Graphics, Springer Science & Business Media, New York, 2006.
[57] Zhao, S. D., Parmigiani, G., Huttenhower, C., Waldron, L., Más-o-Menos: A simple sign averaging method for discrimination in genomic data analysis, Bioinformatics, 30 (2014), 3062-3069.