Earlier this month, a study was published indicating that the widely used “Reddit dataset” (released in 2015 by Jason Baumgartner) had significant, previously unidentified gaps. This study was surprising and incredibly relevant to me personally since I had personally used the Reddit dataset for an analysis blog post without ever validating that the data was complete or even representative. Clearly, this oversight on my part was not uncommon, as this study cites dozens of published, academic papers which mistakenly relied on the dataset. The authors of these papers failed to perform their “data due diligence” and therefore their results have been called into question.
The reliance on this faulty dataset is a clear example of a widespread problem: data scientists and analysts failing to thoroughly validate the integrity of the data that they rely on. As the authors of the study write in the corresponding Medium post, “…researchers have a duty to check the datasets rather than assume their quality on faith.” This is true of both academic researchers and data scientists and analysts who have chosen to work in the industry.
Too often, a data scientist might train and even deploy a machine learning model without first checking the validity of the training data. Similarly, a data analyst might write a report without vetting the underlying data to make sure that it is accurate and comprehensive. In the best case, this means re-training the model or re-running the report costing the data scientist or analyst hours or days of work. In the worst case, releasing a report with false data or a badly trained model can cost companies millions of dollars.
One tool that Qubole provides to enable our customers to quickly check the validity of their dataset is the lightning-fast Presto querying engine. With Presto, data scientists and analysts can quickly identify any gaps or discrepancies in their datasets and work to fix the issue. The real-time querying power of the engine means that users can do sanity checks on their data (like make sure that there are as many rows as expected) without sacrificing their valuable time. At Qubole, we consider data integrity in every aspect of product design and are actively working on packaging these features into a product offering.