This assessment is targeted towards university-level faculty and students interested in assessing their knowledge of data science and big data. This is a beginner-level assessment. Questions are designed to assess knowledge of basic data science and big data principles and practices of a more general nature.
Our set of learning objectives were guided in part by the Pittsburgh Supercomputer Center's Big Data Workshop lesson material, The Coursera Big Data Specialization developed by the San Diego Supercomputer Center, Wikipedia, and other sources. We list the objectives below which are representative of the skills and knowledge to be expected of a big data scientist:
- Briefly list some of the foundations, frameworks and applications of the emerging field of data science.
- Describe the Big Data landscape including examples of real world big data problems and approaches
- Identify the high level components in the data science lifecycle and associated data flow.
- Explain the V’s of Big Data and why each impacts the collection, monitoring, storage, analysis and reporting.
- Summarize the features and significance of the HDFS file system and the MapReduce programming model and how they relate to working with Big Data.
- Identify and assess the needs of an organization for a data science task.
- Collect and manage data to devise solutions to data science tasks.
- Select, apply, and evaluate models to devise solutions to data science tasks.
- Interpret data science analysis outcomes.
- Effectively communicate data science-related information effectively in various formats to appropriate audiences.
- Value and safeguard the ethical use of data in all aspects of their profession.
- Perform basic operations to acquire, explore, prepare, analyze, report, and act on data resources.
- Briefly describe big data and the challenges of capturing, storing and retrieving massive data;
- Briefly describe the application programming interface (API) ecosystem and data infrastructure that supports data acquisition, storage, retrieval and analysis;
- Describe the issues relevant to the application of a data-based analytical approach to identify and solve problems.
The Beginner Big Data Badge consists of a relatively simple 10-question quiz made with basic questions about data science and big data issues. The quiz requires no time limit to complete, and allows up to 5 submissions.
Part 1: Knowledge Assessment
This part consists of a 15-question quiz made with more difficult questions, requiring a time limit to complete, and allowing only 2 submissions.
Part 2: Practical Assessment
For this part of the badge, the user will need to perform a basic word count exercise on two different documents, 1) a document provided by us, and 2) a document of the user's own choosing.
In order to assess the user's performance, the user will need to submit the resulting word count document and document the steps using screen captures and descriptions as the user proceed. The text file the user will use to perform the word count is the complete works of William Shakespeare. The user will then perform another word count on a separate document and provide the results as part of the user's submission.
We have tested this word count exercise for document 1 on several different systems including XSEDE's Comet and Bridges HPC systems as well as Cloudera VM with CentOS 6.7.
Step 1: Perform the word count using Hadoop and MapReduce with the provided Java source code and text file.
The user will perform the necessary steps for a word count using Hadoop MapReduce and HDFS. Provide screen captures of each of the main steps. Include glimpses of the user's desktop to validate that this is the user's own unique system.
Create a document containing a description of the main steps, including commands, you performed to accomplish this task. This document should be a maximum of 1000 words.
Step 2: Perform the word count on a different document of the user's own choosing.
The document of the user's choice on which you perform a word count must be an ASCII text document between 10 Mb and 50 Mb. Submit the same type of screen captures and descriptive document as in Step 1 but also include the text document you used for the user's search.
We will evaluate the user's performance based on both Parts 1 and 2 for the Intermediate Badge and inform you of the results in a timely manner.
Feedback for the Big Data Badge
John Urbanic kindly provided the following feedback:
I went through both badges (but did not submit the written assignment). I think this is a solid format and a good start. My two questions are:
My response to John's feedback:
1) It would be a good problem to have, of course. I'm guessing one workshop would present a short-term challenge, but I'm up for it. I would hope that a submission for the Big Data Intermediate Badge would take no more than 10-15 minutes to grade. 20 submissions means 4-5 hours. Maybe a student intern can help with grading. It would be great if there were a way to automate the process. Where are the machine learning folks when we need them? :-)