Mathematical  Culture and Thought

Mathematical Culture and Thought

50 Years of Data Science

Document Type : Translation

Author
Shahid Beheshti University‎, ‎Iran
Abstract
More than 50 years ago, J. Tukey called for a reformation of academic statistics. In “The ‎f‎uture of data
analysis,” he pointed to the existence of an as-yet unrecognized science, whose subject of interest was
learning from data, or “data analysis.” A recent and growing phenomenon has been the emergence of “data science” programs atmajor universities. This article reviews some ingredients of the current “data science moment,” including recent commentary about data science in the popular media, and about how/whether data science is really different from statistics. The now-contemplated field of data science amounts to a superset of the fields of statistics and machine learning, which adds some
technology for “scaling up” to “big data.” Because all of science itself will soon become data that can be mined, the imminent
revolution in data science is not about mere “scaling up,” but instead the emergence of scientific studies of
data analysis science-wide. I present a vision of data science based on the activities of people who are “learning from data,” and I describe an academic field dedicated to improving that activity in an evidence-basedmanner. This new field is a better academic enlargement of statistics and machine learning than today’s data science initiatives, while being able to accommodate the
same short-term goals.
Keywords
Subjects

Donoho‎, ‎D.‎, ‎50 Years of Data Science‎, ‎J‎. ‎Comput‎. ‎Graph‎. ‎Statist., 26 (2017)‎, ‎no. 4‎, ‎745-766.

[1]    Barlow, M., The Culture of BigData, Sebastopol, OReilly Media, Inc., OCA, 2013.
[2]    Baumer, B., A Data science course for undergraduates: Thinking with data, The American Statistician, 69 (2015), 334-342.
[3]    Bernau, C., Riester, M., Boulesteix, A.-L., Parmigiani, G., Huttenhower, C., Waldron, L., Trippa, L., Cross-Study validation for the assessment of prediction algorithms, Bioinformatics, 30 (2014), i105-i112.
[4]    Breiman, L., Statistical modeling: The two cultures, Statistical Science, 16 (2001), 199-231.
[5]    Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., Munafò, M. R., Power failure: Why small sample size undermines the reliability of neuroscience, Nature Reviews Neuroscience, 14 (2013), 365-376.
[6]    Carp, J., The Secret Lives of Experiments: Methods Reporting in the fMRI Literature, Neuroimage, 63, 289–300, (2012). [759]
[7]    Chambers, J. M., Greater or lesser statistics: A choice for future research, Statistics and Computing, 3 (1993), 182-184.
[8]    Chavalarias, D., Wallach, J., Li, A., Ioannidis, J. A., Evolution of reporting p values in the biomedical literature, 1990–2015, Journal of the American Medical Association, 315 (2016), 1141-1148.
[9]    Cleveland, W. S., The eElements of Graphing Data, Monterey, Wadsworth Advanced Books and Software, CA. 1985.
[10]    Cleveland, W. S., Visualizing Data, Hobart Press, Summit, NJ, 1993.
[11]    Summit, NJ: Data science: An action plan for expanding the technical areas of the field of statistics, Su International Statistical Review, 69 (2001), 21-26.
[12]    Coale, A. J., Stephan, F. F., The case of the indians and the teen-age widows, Journal of the American Statistical Association, 57 (1962), 338-347.
[13]    Collins, F., and Tabak, L. A., Policy: NIH plans to enhance reproducibility, Nature, 505 (2014), 612-613.
[14]    Cook, D., Swayne, D. F., Interactive and Dynamic Graphics for Data Analysis:With R and Gobi, Springer Science & Business Media, New York, 2007.
[15]    Dettling, M., Bag boosting for tumor classification with gene expression data, Bioinformatics, 20 (2004), 3583-3593.
[16]    Donoho, D., Jin, J., Higher criticism thresholding: Optimal feature selection when useful features are rare and weak, Proceedings of the National Academy of Sciences, 105 (2008), 14790-14795.
[17]    Donoho, D. L., Maleki, A., Rahman, I. U., Shahram, M., Stodden, V., Reproducible research in computational harmonic analysis, Computing in Science and Engineering, 11 (2009), 8-18.
[18]    Fisher, R. A., The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7 (1936), 179-188.
[19]    Freire, J., Bonnet, P., Shasha, D., Computational reproducibility: State-of-the-art, challenges, and database research opportunities, in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, ACM, 2012, 593-596.
[20]    Gavish, M., Three dream applications of verifiable computational results, Computing in Science & Engineering, 14 (2012), 26-31.
[21]    Gavish, M., Donoho, D., A universal identifier for computational results, Procedia Computer Science, 4 (2011), 637-647.
[22]    Hand, D. J., Classifier technology and the illusion of progress, Statistical Science, 21 (2006), 1-14.
[23]    Harris, H., Murphy, S.,Vaisman, M., Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work, Sebastopol, O’Reilly Media, Inc, CA, 2013.
[24]    Heroux, M. A., Editorial: ACM TOMS replicated computational results initiative, ACM Transactions on Mathematical Software, 13 (2015), 41, 1-13.
[25]    Horton, N. J., Baumer, B. S., Wickham, H., Taking a chance in the classroom: Setting the stage for data science: Integration of data management skills in introductory and second courses in statistics, CHANCE, 28 (2015), 40-50.
[26]    Hotelling, H., The teaching of statistics, Ann. of Math. Statistics, 11 (1940), 457-470.
[27]    Ioannidis, J. P. A., Contradicted and initially stronger effects in highly cited clinical research, Journal of the American Medical Association, 294 (2005), 218–228.
[28]    Ioannidis, J. P. A., Non-replication and inconsistency in the genome-wide association setting, Human Heredity, 64 (2007), 203–213.
[29]    Ioannidis, J. P. A., Why most discovered true associations are inflated, Epidemiology, 19 (2008), 640-648.
[30]    Iverson, K. E., A personal view of APL, IBM Systems Journal, 30 (1991), 582-593.
[31]    Jager, L. R., Leek, J. T., An estimate of the science-wise false discovery rate and application to the top medical literature, Biostatistics, 15 (2014), 1-12.
[32]    Liberman, M., Fred Jelinek, Computational Linguistics, 36 (2010), 595-599.
[33]    Madigan, D., Stang, P. E., Berlin, J. A., Schuemie, M., Overhage, J. M., Suchard, M. A., Dumouchel, B., Hartzema, A. G., Ryan, P. B., A systematic statistical approach to evaluating evidence from observational studies, Annual Review of Statistics and Its Application, 1 (2014), 11–39.
[34]    Marchi, M., Albert, J., Analyzing Baseball Data with R, CRC Press, Boca Raton, FL, 2013.
[35]    McNutt, M., Reproducibility, Science, 343 (2014), 229.
[36]    Mosteller, F., Tukey, J. W., Data analysis, including statistics, in Handbook of Social Psychology, G. Lindzey, E.Aronson, eds. Addison-Wesley, Reading, MA, 1968, 80-203.
[37]    Open Science Collaboration et al., Estimating the reproducibility of psychological science, Science, 349 (2015), aac4716.
[38]    Pan, Z., Trikalinos, T. A., Kavvoura, F. K., Lau, J., Ioannidis, J. P. A., Local literature bias in genetic epidemiology: An downloaded by empirical evaluation of the chinese literature, PLoS Medicine, 2 (2005), 1309.
[39]    Peng, R. D., Reproducible research and biostatistics, Biostatistics, 10 (2009), 405–408.
[40]    Prinz, F., Schlange, T., Asadullah, K., Believe it or not: How much can we rely on published data on potential drug targets?, Nature Reviews Drug Discovery, 10 (2011), 712-712.
[41]    Ryan, P. B., Madigan, D., Stang, P. E., Overhage, J. M., Racoosin, J. A., Hartzema, A. G., Empirical assessment of methods for risk identification in healthcare data: Results fromthe experiments of the observational medical outcomes partnership, Statistics in Medicine, 31 (2012), 4401-4415.
[42]    Stodden, V., Reproducible research: Tools and strategies for scientific computing, Computing in Science and Engineering, 14 (2012), 11-25.
[43]    Stodden, V., Guo, P., Ma, Z., Toward reproducible computational research: An empirical analysis of data and code policy adoption by journals, PLoS ONE, 8 (2013), e67111.
[44]    Stodden, V., Leisch, F., Peng, R. D., eds., Implementing Reproducible Research, Chapman & Hall/CRC, Boca RatonFL, 2014.
[45]    Stodden, V., Miguez, S., Best practices for computational science: Software infrastructure and environments for reproducible and extensible research, Journal of Open Research Software,1 (2014), e21.
[46]    Sullivan, P. F., Spurious genetic associations, Biological Psychiatry, 61 (2007), 1121-1126.
[47]    Tango, T. M., Lichtman, M. G., Dolphin, A. E., The Book: Playing the Percentages in Baseball, Potomac Books, Inc, Lincoln, NE, 2007.
[48]    Tukey, J. W., The future of data analysis, The Annals of Mathematical Statistics, 33 (1962), 1-67.
[49]    Tukey, J. W., Exploratory Data Analysis, Addison-Wesley, Reading, MA, 1977.
[50]    Tukey, J. W., The Collected Works of JohnW. Tukey, Multiple Comparisons, vol. 1, H. I. Braun, eds., Wadsworth&Brooks/Cole, Pacific Grove, CA, 1994.
[51]    Wandell, B. A., Rokem, A., Perry, L. M., Schaefer, G., Dougherty, R. F., Quantitative biology – Quantitative methods, Bibliographic code (2015), available at arXiv: 150206900W.
[52]    Wickham, H., Reshaping data with the reshape package, Journal of Statistical Software, 21 (2007), 1-20.
[53]    Wickham, H., ggplot2, Wiley Interdisciplinary Reviews: Computational Statistics, 3 (2011), 180185.
[54]    Wickham, H., The split-apply-combine strategy for data analysis, Journal of Statistical Software, 40 (2011), 1-29.
[55]    Wickham, H., Tidy data, Journal of Statistical Software, 59 (2014), 1-23.
[56]    Wilkinson, L., The Grammar of Graphics, Springer Science & Business Media, New York, 2006.
[57]    Zhao, S. D., Parmigiani, G., Huttenhower, C., Waldron, L., Más-o-Menos: A simple sign averaging method for discrimination in genomic data analysis, Bioinformatics, 30 (2014), 3062-3069.

  • Receive Date 14 June 2023
  • Accept Date 27 June 2023
  • Publish Date 22 July 2024