Pros & Cons of Using Production and Generated Data for Software Testing

This is the fourth #BestPractices blog post of a series, by Kevin Parker.


Testing is crucial to ensure the quality and reliability of applications. A strategic question that QA leaders must answer is what data to test against? One approach is to utilize production data for testing purposes. This seems convenient, but comes with a serious set of challenges and risks. The other approach is to generate the test data based on key design considerations, so called Synthetic Data Generation. This blog post explores the pros and cons of using production data versus generated data, and discusses why using synthetically generated data is the best practice.

Production Data for Testing

There are several pros for using production data.

  1. Expediency: One primary advantage of using production data is the speed at which it can be implemented. Testers can quickly access real-world data to assess the application’s performance, functionality, and identify potential issues.
  2. Realistic Scenarios: Production data reflects actual user behavior and real-world scenarios.
  3. Large Dataset: Production databases often contain a vast amount of data, making it possible to perform thorough testing on various aspects of the application.

Unfortunately, there are also serious cons to using Production Data.

  1. Privacy Concerns: Production data may contain sensitive information such as Personally Identifiable Information (PII) and Protected Health Information (PHI). To mitigate privacy risks, this data must be thoroughly anonymized or pseudonymized before using it for testing.
  2. Incomplete Dataset: The production database may not encompass every possible combination and permutation of data, which can lead to incomplete test coverage and potential issues going unnoticed.
  3. Format Incompatibility: When new features are added to the application, the production data may be insufficient or not in the right format, leading to compatibility issues and inaccurate testing results.
  4. Unsupported Data: The production data might include obsolete or unsupported elements that are no longer valid for the current version of the application. This can lead to misleading test outcomes.
  5. Data Restoration Challenges: Production databases can be massive, making it difficult to restore and refresh the data frequently. Outdated data may lead to testing against inaccurate or irrelevant information.
  6. Data Volatility: The production database is dynamic and continuously changing. Data that exists today may be unavailable tomorrow, causing inconsistency and unpredictability in test results.

Best Practices for Using Production Data

Notwithstanding the downsides and the better alternative of using generated data, here are some best practices for using production data in testing.

  1. Data Masking and Anonymization: Before using production data, ensure that all sensitive information is properly anonymized or masked to protect user privacy. This process should be done meticulously, removing all personally identifiable details while retaining data integrity.
  2. Subset Selection: Instead of using the entire production dataset, consider creating subsets that cover the relevant test scenarios. This reduces the data size and simplifies the testing process.
  3. Data Conversion: When introducing new features, convert the production data into the appropriate format to match the updated application. This ensures accurate testing and validation of new functionalities.
  4. Version Control: Implement version control mechanisms to manage the test data effectively. This helps in tracking changes and ensuring data consistency across testing cycles.

Overall Best Practice: Synthetic Data Generation

The overall best practice for the provision of test data is to design and generate it, so-called Synthetic Data Generation. This involves generating synthetic data that encompasses various data combinations and scenarios, an approach that ensures comprehensive test coverage.

AIQ has robust Synthetic Data Generation capabilities, including a vast library of fictional names, streets, cities, email addresses, colors, sizes, part numbers, etc. These can be generated in any combination to create representative test data. Further, AIQ can add regular expressions (so called Regex) to the test data to conform to a particular pattern, e.g., product codes or customer codes, or to create dates in the future (for a delivery date) or dates in the past (for a birth date).

It is important to have test data that tests all the corner cases (the domain of each data element) and the valid and invalid combinations (positive and negative testing). Plus, it must be a stable dataset. AIQ’s test data generation provides that stable dataset.


While using production data can be a tempting choice due to its expedience and realism, it comes with significant challenges. Anonymization of sensitive data and selecting relevant subsets are crucial steps to ensure data integrity and privacy, albeit the use of production data remains prone to failure. Instead, a well-designed and properly generated test data set is essential for identifying and resolving issues in software applications without compromising user privacy or data accuracy.

This is the fourth #BestPractices blog post of a series, by Kevin Parker.

Recent Blog Posts

Read Other Recent Articles

AI-driven testing changes everything for testing teams. These Best Practices ensure best outcomes.  I’ve recently published a series of posts on Best Practices for different aspects of software QA in the age of AI-driven testing. This post serves as a portal to them. Before listing the posts, it’s worth noting that everything has changed in

AI-driven testing leads to new forms of team composition and compensation. AI is a force-multiplier for test teams, a reality that’s driving new thinking about how test teams are composed and compensated. This is because AI-driven testing enables test teams to finally keep pace with dev teams, albeit with a radically reformed approach to the

AI-enabled software testing changes the game for testing teams and their leaders. Here are four best practices and an important tip for making the most of this unprecedentedly powerful automation technology. Best Practice #1: Segment test cases for human or AI creation. Identify the critical test cases that humans should write. Have test engineers write

Empower Your Team. Unleash More Potential. See What AIQ Can Do For Your Business

footer cta image
footer cta image