Skip to end of metadata
Go to start of metadata

This assessment is targeted towards university-level faculty and students interested in assessing their knowledge of data science and big data. This is a beginner-level assessment. Questions are designed to assess knowledge of basic data science and big data principles and practices of a more general nature.

Learning Objectives

Our set of learning objectives were guided in part by the Pittsburgh Supercomputer Center's Big Data Workshop lesson material, The Coursera Big Data Specialization developed by the San Diego Supercomputer Center, Wikipedia, and other sources. We list the objectives below which are representative of the skills and knowledge to be expected of a big data scientist:

  • Briefly list some of the foundations, frameworks and applications of the emerging field of data science.
  • Describe the Big Data landscape including examples of real world big data problems and approaches
  • Identify the high level components in the data science lifecycle and associated data flow.
  • Explain the V’s of Big Data and why each impacts the collection, monitoring, storage, analysis and reporting.
  • Summarize the features and significance of the HDFS file system and the MapReduce programming model and how they relate to working with Big Data.
  • Identify and assess the needs of an organization for a data science task.
  • Collect and manage data to devise solutions to data science tasks.
  • Select, apply, and evaluate models to devise solutions to data science tasks.
  • Interpret data science analysis outcomes.
  • Effectively communicate data science-related information effectively in various formats to appropriate audiences.
  • Value and safeguard the ethical use of data in all aspects of their profession.
  • Perform basic operations to acquire, explore, prepare, analyze, report, and act on data resources.
  • Briefly describe big data and the challenges of capturing, storing and retrieving massive data;
  • Briefly describe the application programming interface (API) ecosystem and data infrastructure that supports data acquisition, storage, retrieval and analysis;
  • Describe the issues relevant to the application of a data-based analytical approach to identify and solve problems.

Beginner Badge

The Beginner Big Data Badge consists of a relatively simple 10-question quiz made with basic questions about data science and big data issues. The quiz requires no time limit to complete, and allows up to 5 submissions. 

Intermediate Badge

Part 1: Knowledge Assessment

This part consists of a 15-question quiz made with more difficult questions, requiring a time limit to complete, and allowing only 2 submissions. 

Part 2: Practical Assessment

For this part of the badge, the user will need to perform a basic word count exercise on two different documents, 1) a document provided by us, and 2) a document of the user's own choosing. 

In order to assess the user's performance, the user will need to submit the resulting word count document and document the steps using screen captures and descriptions as the user proceed. The text file the user will use to perform the word count is the complete works of William Shakespeare. The user will then perform another word count on a separate document and provide the results as part of the user's submission. 

We have tested this word count exercise for document 1 on several different systems including XSEDE's Comet and Bridges HPC systems as well as Cloudera VM  with CentOS 6.7. 

Step 1: Perform the word count using Hadoop and MapReduce with the provided Java source code and text file. 

Download the Java source.

Download the complete works of Shakespeare text file.

The user will perform the necessary steps for a word count using Hadoop MapReduce and HDFS. Provide screen captures of each of the main steps. Include glimpses of the user's desktop to validate that this is the user's own unique system. 

Create a document containing a description of the main steps, including commands, you performed to accomplish this task. This document should be a maximum of 1000 words. 


Step 2: Perform the word count on a different document of the user's own choosing. 

The document of the user's choice on which you perform a word count must be an ASCII text document between 10 Mb and 50 Mb. Submit the same type of screen captures and descriptive document as in Step 1 but also include the text document you used for the user's search. 

We will evaluate the user's performance based on both Parts 1 and 2 for the Intermediate Badge and inform you of the results in a timely manner.


Feedback for the Big Data Badge

John Urbanic kindly provided the following feedback:

I went through both badges (but did not submit the written assignment). I think this is a solid format and a good start. My two questions are:

1) Who does the grading on the written assignment? I assume you. How much business can we scare up before you regret that? Your current problem actually looks pretty friendly to evaluate, so it should scale well.
2) Are we intending for the questions to align perfectly with any particular content? The current version has a lot of overlap with our Big Data course, but not completely. Maybe that is OK, maybe we want to converge. Just let me know so my criticism in that regard is on target.
Again, great work.

My response to John's feedback:

1) It would be a good problem to have, of course. I'm guessing one workshop would present a short-term challenge, but I'm up for it. I would hope that a submission for the Big Data Intermediate Badge would take no more than 10-15 minutes to grade. 20 submissions means 4-5 hours. Maybe a student intern can help with grading. It would be great if there were a way to automate the process. Where are the machine learning folks when we need them?  :-)

MOOCs like Coursera offer a way to define regular expressions which will search the output of a student's response for particular patterns, but that only checks the results of a file output and won't tell us anything about the student's full understanding of the process. It's easy enough to run the code without using Hadoop to get the correct results and thus cheat to get the badge.
2) That is a good question. My feeling is yes, we should align more with a specific workshop. I think that's what badges are supposed to be all about; you jump through a specific hoop before you get the badge. That's why the DataViz badge is a bit out of place. The problem is that we don't give many data viz workshops, so we thought the badge should be more generic, not specific to any particular workshop or application. There appear to be frequent enough workshops on data science and big data to make alignment easier. The PSC workshop I attended in March was good even though it had a few bumps, but it would have been useful for me to review a recording to clarify some of the slide content and help refine the quiz questions a little bit more. I was unable to attend the more recent two-day workshop in May unfortunately, but that would probably align well with the intermediate and an advanced badge. 
Based on the above feedback I recommend proceeding to promote the badge at the upcoming XSEDE Big Data Workshop/Webinar. 

Update on Big Data Badge Awards

In January, 2018, five successful attempts were made for the Big Data Beginner Badge. 
  • No labels