This is the fourth #BestPractices blog post of a series, by Kevin Parker.
Introduction
Testing is crucial to ensure the quality and reliability of applications. A strategic question that QA leaders must answer is what data to test against? One approach is to utilize production data for testing purposes. This seems convenient, but comes with a serious set of challenges and risks. The other approach is to generate the test data based on key design considerations, so called Synthetic Data Generation. This blog post explores the pros and cons of using production data versus generated data, and discusses why using synthetically generated data is the best practice.
Production Data for Testing
There are several pros for using production data.
- Expediency: One primary advantage of using production data is the speed at which it can be implemented. Testers can quickly access real-world data to assess the application’s performance, functionality, and identify potential issues.
- Realistic Scenarios: Production data reflects actual user behavior and real-world scenarios.
- Large Dataset: Production databases often contain a vast amount of data, making it possible to perform thorough testing on various aspects of the application.
Unfortunately, there are also serious cons to using Production Data.
- Privacy Concerns: Production data may contain sensitive information such as Personally Identifiable Information (PII) and Protected Health Information (PHI). To mitigate privacy risks, this data must be thoroughly anonymized or pseudonymized before using it for testing.
- Incomplete Dataset: The production database may not encompass every possible combination and permutation of data, which can lead to incomplete test coverage and potential issues going unnoticed.
- Format Incompatibility: When new features are added to the application, the production data may be insufficient or not in the right format, leading to compatibility issues and inaccurate testing results.
- Unsupported Data: The production data might include obsolete or unsupported elements that are no longer valid for the current version of the application. This can lead to misleading test outcomes.
- Data Restoration Challenges: Production databases can be massive, making it difficult to restore and refresh the data frequently. Outdated data may lead to testing against inaccurate or irrelevant information.
- Data Volatility: The production database is dynamic and continuously changing. Data that exists today may be unavailable tomorrow, causing inconsistency and unpredictability in test results.
Best Practices for Using Production Data
Notwithstanding the downsides and the better alternative of using generated data, here are some best practices for using production data in testing.
- Data Masking and Anonymization: Before using production data, ensure that all sensitive information is properly anonymized or masked to protect user privacy. This process should be done meticulously, removing all personally identifiable details while retaining data integrity.
- Subset Selection: Instead of using the entire production dataset, consider creating subsets that cover the relevant test scenarios. This reduces the data size and simplifies the testing process.
- Data Conversion: When introducing new features, convert the production data into the appropriate format to match the updated application. This ensures accurate testing and validation of new functionalities.
- Version Control: Implement version control mechanisms to manage the test data effectively. This helps in tracking changes and ensuring data consistency across testing cycles.
Overall Best Practice: Synthetic Data Generation
The overall best practice for the provision of test data is to design and generate it, so-called Synthetic Data Generation. This involves generating synthetic data that encompasses various data combinations and scenarios, an approach that ensures comprehensive test coverage.
AIQ has robust Synthetic Data Generation capabilities, including a vast library of fictional names, streets, cities, email addresses, colors, sizes, part numbers, etc. These can be generated in any combination to create representative test data. Further, AIQ can add regular expressions (so called Regex) to the test data to conform to a particular pattern, e.g., product codes or customer codes, or to create dates in the future (for a delivery date) or dates in the past (for a birth date).
It is important to have test data that tests all the corner cases (the domain of each data element) and the valid and invalid combinations (positive and negative testing). Plus, it must be a stable dataset. AIQ’s test data generation provides that stable dataset.
Conclusion
While using production data can be a tempting choice due to its expedience and realism, it comes with significant challenges. Anonymization of sensitive data and selecting relevant subsets are crucial steps to ensure data integrity and privacy, albeit the use of production data remains prone to failure. Instead, a well-designed and properly generated test data set is essential for identifying and resolving issues in software applications without compromising user privacy or data accuracy.
This is the fourth #BestPractices blog post of a series, by Kevin Parker.
For a complete resource on all things Generative AI, read our blog “What is Generative AI in Software Testing.”