As testers, we often have a love-hate relationship with data. Processing data is our applications’ main reason for being and without data we cannot test. Yet, data is often the root cause of testing issues; we don’t always have the data we need, which causes blocked test cases and defects get returned as “data issues”.
Data has grown exponentially over the last few years and continues to grow. We began testing with megabytes and gigabytes and now terabytes and petabytes have joined the data landscape. Data is now the elephant in the room, and where is it leading us? Testers, welcome to the brave new world of Big Data!
What is Big Data?
Big Data has lots of definitions; it is a term often used to define both volume and process. Sometimes, the term Big Data is used to refer to the approaches and tools used for processing large amounts of data. Wikipedia defines it as “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.” Gartner defines big data as “high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.” Big data usually refers to at least five petabytes (5,000,000,000 megabytes). Sometimes the term Big Data is used to refer to the approaches and tools used for processing large amounts of data.
However, Big Data is more than just size. It’s most significant aspects are the four “v’s”. Big data obviously has huge volume, the sheer amount of data, however, it has velocity, the speed at which new data is generated and transported, variety which refers to the many types of data, and veracity, its accuracy and quality.
Testers, can you see some, make that many, test scenarios here? Yes, big data means big testing. In addition to ensuring data quality, we need to make sure that our applications can effectively process this much data. However, before we can plan our big testing, we need to learn more about the brave new world of big data.
Big Data is usually unstructured which means that it is does not have a defined data model. It does not fit neatly into organized columns and rows. Although much of the unstructured big data comes from the social media such as Facebook posts, tweets, it can also take audio and visual forms. These include phone calls, instant messages voice mails, pictures, videos, pdf’s, geospatial data and slide shares. So it seems our big testing SUT (system under test) is actually a giant jelly fish!
Challenges of Big Data Testing
Testing Big Data is like testing a Jelly Fish; because of the sheer amount and its unstructured nature, the test process is difficult to define. Automation is required and although there are many tools, they are complex and require technical skills for troubleshooting. Performance testing is also exceedingly complex giving the velocity at which the data is processed.
Testing the Jelly Fish
At the highest level, the big data test approach involves both functional and non-functional components. Functional testing includes validating both the quality of the data itself and the processing of it. Test scenarios in data quality include completeness, correctness, lack of duplication, etc. Data processing can be done in three ways; interactive, real-time and batch; however, they all involve movement of data. Therefore, all big data testing strategies are based on the extract, transform and load (ETL) process. It begins by validating data quality coming from the source databases, then validating the transformation or process through which the data is structured and then validating the load into the data warehouse.
ETL testing has three phases. The first phase is the data staging. Data staging is validated by comparing the data coming from the source systems to the data in the staged location. The next phase is the MapReduce validation or validation of the transformation of the data. [I think you’re going to have to explain what MapReduce is here. It’s basically the programming model for unstructured data; probably the best-known implementation is in Hadoop.]This testing ensures that the business rules used to aggregate and segregate the data are working properly. The final ETL phase is the output validation phase where the output files from the MapReduce and are ready to be moved to the data warehouse. In this stage, the data integrity and the transformation is complete and correct. ETL testing, especially of the speed required for big data, require automation and luckily there are tools for each phase of the ETL process, the most well-known are Mongo, Cassandra, Hadoop and Hive.
Do You Want To Be A Big Data Tester?
Testers, if you have a technical background, especially in Java, big data testing may be for you. You already have strong analytical skills and you will need to become proficient in Hadoop and other Big Data tools. Big Data is a fast-growing technology and testers with this skill set are in demand. Why not take the challenge, be brave and embrace the brave new world of big data testing!