Big Data Software Engineering

Major Venues | Big Data Analysis | Big Data Computing | Big Data Software Engineering | Big Data Software Tools

Major Venues

  1. International Workshop on Big Data Software Engineering (IEEE/ACM)
  2. International Conference on Very Large Data Bases (ACM)
  3. International Conference on Management of Data (ACM)
  4. International Conference on Big Data and Cloud Computing (IEEE)
  5. International Workshop on Large Scale Testing (ACM)

Big Data Analysis

  1. Fan, J., Han, F., & Liu, H., Challenges of big data analysis, National science review1(2), pp. 293-314, 2014.
  2. Labrinidis, A., & Jagadish, H. V., Challenges and opportunities with big data, In Proceedings of the VLDB Endowment, 5(12), pp. 2032-2033, 2012.
  3. Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U., The rise of “big data” on cloud computing: review and open research issues, Information Systems JournalVolume 47, pp. 98-115, 2015.
  4. Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H. A., & Mankovskii, S., Solving big data challenges for enterprise application performance management, In Proceedings of the VLDB Endowment, 5(12), pp. 1724-1735, 2012.
  5. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., & Stonebraker, M., A comparison of approaches to large-scale data analysis, In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 165-178, 2009.
  6. Lazer, D., Kennedy, R., King, G., & Vespignani, A., The parable of Google Flu: traps in big data analysis, Sciencep. 343, 2014.
  7. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., & Welton, C., MAD skills: new analysis practices for big data, In Proceedings of the VLDB Endowment2(2), pp. 1481-1492, 2009.
  8. Hu, H., Wen, Y., Chua, T. S., & Li, X., Toward scalable systems for big data analytics: A technology tutorial, Access, IEEE2, pp. 652-687, 2014.
  9. Rosa, A., Chen, L. Y., & Binder, W., Predicting and Mitigating Jobs Failures in Big Data Clusters, In 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'15), pp. 221-230, 2015.
  10. Rabl, T., Danisch, M., Frank, M., Schindler, S., & Jacobsen, H. A., Just can't get enough: Synthesizing Big Data, In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1457-1462, 2015.
  11. Otero, C. E., & Peter, A., Research Directions for Engineering Big Data Analytics Software, Intelligent Systems30(1), pp. 13-19, 2015.
  12. Kambatla, K., Kollias, G., Kumar, V., & Grama, A., Trends in big data analytics, Journal of Parallel and Distributed Computing74(7), pp. 2561-2573, 2014.
  13. Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Buyya, R., Big Data computing and clouds: Trends and future directions, Journal of Parallel and Distributed Computing79, pp. 3-15, 2015.
  14. Yang, Q., & Wu, X., 10 challenging problems in data mining research, In International Journal of Information Technology & Decision Making5(04), pp. 597-604, 2006.
  15. Borkar, V., Carey, M. J., & Li, C., Inside Big Data management: ogres, onions, or parfaits?, In Proceedings of the 15th International Conference on Extending Database Technology, pp. 3-14, 2012.

Big Data Computing

  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., & Rasin, A., HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads, In Proceedings of the VLDB Endowment, 2(1), pp. 922-933, 2009. Download it here
  2. Alexandrov, A., Heimel, M., Markl, V., Battré, D., Hueske, F., Nijkamp, E., & Warneke, D., Massively parallel data analysis with PACTs on nephele , In Proceedings of the VLDB Endowment3(1-2), pp. 1625-1628, 2010.
  3. Alexandrov, A., Tzoumas, K., & Markl, V., Myriad: scalable and expressive data generation, In Proceedings of the VLDB Endowment5(12), pp. 1890-1893, 2012. Download it here
  4. Torlak, E., Scalable test data generation from multidimensional models, In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, p. 36, 2012.
  5. Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., & Murthy, R., Hive: a warehousing solution over a map-reduce framework, In Proceedings of the VLDB Endowment2(2), pp 1626-1629, 2009.
  6. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., & Jacobsen, H. A. BigBench: towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, pp. 1197-1208, 2013.
  7. Rabl, T., Frank, M., Danisch, M., Jacobsen, H. A., & Gowda, B., The Vision of BigBench 2.0, In Proceedings of the Fourth Workshop on Data analytics in the Cloud, p. 3, 2015.
  8. Huang, S., Huang, J., Dai, J., Xie, T., & Huang, B., The HiBench benchmark suite: Characterization of the MapReduce-based data analysis, In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 41-51, 2010.
  9. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., & Gruber, R. E., Bigtable: A distributed storage system for structured data, In ACM Transactions on Computer Systems (TOCS)26(2), 4, 2008.
  10. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F. B., & Babu, S., Starfish: A Self-tuning System for Big Data Analytics, In CIDR(Vol. 11, pp. 261-272, 2011.
  11. Megler, V. M., & Maier, D., When big data leads to lost data, In Proceedings of the 5th Ph. D. workshop on Information and knowledge, pp. 1-8, 2012.
  12. Ghemawat, S., Gobioff, H., & Leung, S. T., The Google file system, In ACM SIGOPS operating systems review, Vol. 37, No. 5, pp. 29-43, 2003.
  13. Alsubaiee, S., Altowim, Y., Altwaijry, H., Behm, A., Borkar, V., Bu, Y., & Wen, J., ASTERIX: an open source system for Big Data management and analysis, In Proceedings of the VLDB Endowment5(12), pp. 1898-1901, 2012.
  14. Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J. C., Hueske, F., Heise, A., & Warneke, D.,The Stratosphere platform for big data analytics, In The VLDB Journal—The International Journal on Very Large Data Bases23(6), pp. 939-964, 2014.
  15. Alexandrov, A., Schiefer, B., Poelman, J., Ewen, S., Bodner, T. O., & Markl, V., Myriad: parallel data generation on shared-nothing architectures, In Proceedings of the 1st Workshop on Architectures and Systems for Big Data, pp. 30-33, 2011.
  16. Arlitt, M., Marwah, M., Bellala, G., Shah, A., Healey, J., & Vandiver, B., IoTAbench: an Internet of Things Analytics benchmark, In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, pp. 133-144, 2015.
  17. Nowling, R. J., & Vyas, J., A Domain-Driven, Generative Data Model for Big Pet Store, In IEEE Fourth International Conference on Big Data and Cloud Computing (BdCloud'14), pp. 49-55, 2014.
  18. Shvachko, K., Kuang, H., Radia, S., & Chansler, R., The hadoop distributed file system, In 26th Symposium on Mass Storage Systems and Technologies (MSST'10), pp. 1-10, 2010.
  19. Abouzied, A., Bajda-Pawlikowski, K., Huang, J., Abadi, D. J., & Silberschatz, A., HadoopDB in action: building real world applications, In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 1111-1114, 2010.
  20. Borthakur, D., Gray, J., Sarma, J. S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., & Aiyer, A., Apache Hadoop goes realtime at Facebook, In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 1071-1080, 2011.
  21. Borthakur, D., The hadoop distributed file system: Architecture and design, Hadoop Project Website11(2007), p. 21, 2007.
  22. Huang, J., Zhang, X., & Schwan, K., Understanding issue correlations: a case study of the Hadoop system, In Proceedings of the 6th ACM Symposium on Cloud Computing, pp. 2-15, 2015.
  23. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I., Spark: cluster computing with working sets, In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, Vol. 10, p. 10, 2010.
  24. Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., & Zaharia, M., Spark SQL: Relational data processing in Spark, In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383-1394, 2015.
  25. Dean, J., & Ghemawat, S., MapReduce: simplified data processing on large clusters, In Communications of the ACM51(1), pp. 107-113, 2008.
  26. Dittrich, J., & Quiané-Ruiz, J. A., Efficient big data processing in Hadoop MapReduce, In Proceedings of the VLDB Endowment5(12), pp. 2014-2015, 2012.
  27. Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., & Murthy, R., Hive-a petabyte scale data warehouse using hadoop, In 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996-1005, 2010.
  28. Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A., Pig latin: a not-so-foreign language for data processing, In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1099-1110, 2008.

Big Data Software Engineering

  1. Sneed, H. M., & Erdoes, K., Testing big data (Assuring the quality of large databases), IEEE 8th International Conference on Software Testing, Verification and Validation Workshops (ICSTW'15), pp. 1-6, 2015.
  2. Yesudas, M., & Nair, S. K., High-Volume Performance Test Framework using Big Data, In Proceedings of the 4th International Workshop on Large-Scale Testing, pp. 13-16, 2015.
  3. Li, N., Escalona, A., Guo, Y., & Offutt, J., A Scalable Big Data Test Framework, IEEE 8th International Conference on Software Testing, Verification and Validation (ICST'15), pp. 1-2, 2015.
  4. Alexandrov, A., Brücke, C., & Markl, V., Issues in big data testing and benchmarking, In Proceedings of the 6th International Workshop on Testing Database Systems, p. 1, 2013.
  5. Morán, J., Riva, C. D. L., & Tuya, J., Testing data transformations in MapReduce programs , In Proceedings of the 6th International Workshop on Automating Test Case Design, Selection and Evaluation, pp. 20-25, 2015.
  6. Liu, Z., Research of performance test technology for big data applications, IEEE International Conference on Information and Automation (ICIA'14), pp. 53-58, 2014.
  7. Anderson, K. M., Embrace the challenges: software engineering in a big data world, In Proceedings of the First International Workshop on BIG Data Software Engineering (BIGDSE'15), pp. 19-25, 2015
  8. DeLine, R., Research opportunities for the big data era of software engineering, In IEEE/ACM 1st International Workshop on Big Data Software Engineering (BIGDSE'15), pp. 26-29, 2015.
  9. Madhavji, N. H., Miranskyy, A., & Kontogiannis, K., Big picture of big data software engineering: with example research challenges, In Proceedings of the First International Workshop on BIG Data Software Engineering, pp. 11-14, 2015.

Big Data Software Tools

  1. Data Mining Software: Weka, Mozenda, R Programming, Orange, NLTK
  2. Data Visualization Software: D3.js, Highcharts, jHepWork
  3. Data Analytics Software: RapidMiner, KNIME
  4. Big Data Compute Tools: Hadoop, MapReduce, Spark, HDFS, Hive, Pig

Data mining - Association Rule Mining

  1. Association Analysis: Basic Concepts and Algorithms : Tutorial
  2. A Survey of Association Rules: Survey Paper
  3. Measures for Predictive Analysis Rules: Comparing Rule Measures for Predictive Association Rules
  4. Association Rule:Lecture Notes # 1
  5. Association Rule:Lecture Notes # 2
  6. Association Rule:Lecture Notes # 3
  7. Using Apriori Algorithm in Weka: Tutorial
  8. Measures for Association Rules: A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules
  9. CAR: Classification and Association Rule Mining: Integrating Classification and Association Rule Mining

Association Rule Mining: Online Video tutorials

  1. WekaMOOC : Data Mining with Weka
  2. Weka: Creating Training, Validation and Test Sets (Data Preprocessing)

Association Rule Mining: Sample Source Code

  1. Apriori algorithm: for mining frequent itemsets: Sample code # 1
  2. Apriori algorithm: frequent itemset generation in Java: Sample code # 2

Classification in Data Mining

  1. Classification: Basic Concepts, Decision Trees and Model Evaluation Chapter 4: Reference book

Big Data Benchmarks

  1. Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., & Zheng, C., BigDataBench: a Big Data Benchmark Suite from Internet Services, IEEE 20th International Symposium on High Performance Computer Architecture, pp.488-499.
  2. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., & Jacobsen, H. A., BigBench: towards an industry standard benchmark for big data analytics, In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, pp.1197-1208.
  3. Alexandrov, A., Brücke, C., & Markl, V., Issues in big data testing and benchmarking, In Proceedings of the Sixth International Workshop on Testing Database Systems.
  4. Han, R., Lu, X., & Xu, J., On Big Data Benchmarking, Big Data Benchmarks, Performance Optimization, and Emerging Hardware Springer International Publishing, pp.3-18.

  5. Notes

    Libraries used:

    1. Google Scholar
    2. IEEE
    3. ACM
    4. Springler
    5. Science Journal
    6. www.libguides.uta.edu/CSE/

    Keywords used:

    Big Data, Big Data Analysis, Big Data Analytics, big data tools, Big data review, testing big data, big data architecture, large scale data, big data challenges, hadoop, mapreduce, big data benchmarking, big data software engineering, big data software, issues in big data, issues in data mining

    Criteria for Shortlisting Papers

    Use the above keywords to search for papers in the libraries mentioned above. To shortlist papers you need to perform 3 selections, First check the title of the paper. If the title talks about Big Data overview or about big data analytics, the research issues and current problems in working with or handling Big Data software etc shortlist the paper. The focus should not be on the compute architecture but big data software as a whole. If you need more hints or if the title is not clear, read the Abstract and skim the paper to get a good idea about what the paper is about. Once you have collected potential interesting papers, see if the paper is presented in good conferences or journals. Some of the interesting conferences and journals for Big Data research have been identified above. If the paper is from one of these conferences, shortlist that paper. In the 3rd attempt you need to have a look at the remaining papers which haven't made the cut and identify if you happened to overlook any of them.