50 GPT-5.5 Prompts for Data Scientists: Machine Learning, Data Cleaning, and Statistical Analysis

50 GPT-5.5 Prompts for Data Scientists: Machine Learning, Data Cleaning, and Statistical Analysis
In the rapidly evolving landscape of data science, leveraging AI tools like GPT-5.5 to automate and enhance various stages of the data pipeline can dramatically improve productivity and insight quality. This comprehensive guide presents 50 production-ready GPT-5.5 prompts tailored specifically for data scientists. These prompts cover key domains including exploratory data analysis, data cleaning and preprocessing, feature engineering, model selection and training, as well as results interpretation and reporting. Each prompt is designed with clear context, customization placeholders, and expected output formats to ensure seamless integration into your data science workflows.
1. Exploratory Data Analysis (EDA)
Exploratory Data Analysis is the foundational step in any data science project, where you understand the dataset’s structure, detect patterns, and uncover relationships. Using GPT-5.5 for EDA can speed up profiling, distribution analysis, and correlation discovery by generating concise summaries and actionable insights.
Practical Tips for EDA Prompts: Always specify the dataset context and variables of interest. Request outputs in markdown tables or JSON to facilitate downstream automation. Use prompts that encourage GPT-5.5 to highlight anomalies or suggest visualization ideas.
Prompt 1: Dataset Profiling Summary
Context: Use this prompt to generate a high-level summary of dataset statistics including data types, missing values, and unique value counts.
Prompt:
"Generate a detailed profiling summary for the dataset [DATASET_NAME]. Include columns, their data types, percentage of missing values, unique value counts, and brief notes on any anomalies or irregularities. Format the output as a markdown table."
Expected Output Format: Markdown table with columns | Feature | Data Type | % Missing | Unique Values | Notes |
Prompt 2: Numerical Distribution Analysis
Context: To understand the distribution of a numerical feature, including skewness, kurtosis, and presence of outliers.
Prompt:
"Analyze the distribution of the numerical feature [FEATURE_NAME] from dataset [DATASET_NAME]. Provide summary statistics (mean, median, mode, std, variance), skewness, kurtosis, and identify outlier thresholds using IQR method. Suggest suitable visualizations."
Expected Output Format: JSON with statistics and a textual summary.
Prompt 3: Categorical Feature Frequency
Context: For frequency analysis and unique categories of a categorical variable.
Prompt:
"Provide a frequency distribution report for categorical feature [FEATURE_NAME] in dataset [DATASET_NAME]. Include category counts, percentages, and highlight any categories with very low frequency (below 1%)."
Expected Output Format: Markdown table with columns | Category | Count | Percentage |
Prompt 4: Pairwise Correlation Matrix
Context: To identify relationships among numerical features and detect multicollinearity.
Prompt:
"Calculate the pairwise Pearson correlation coefficients among numerical features [FEATURE_LIST] in dataset [DATASET_NAME]. Highlight strong correlations above 0.7 or below -0.7 and suggest potential feature redundancies."
Expected Output Format: Markdown table or CSV format showing correlation matrix.
Prompt 5: Correlation with Target Variable
Context: To find features most correlated with the target variable for feature selection.
Prompt:
"Analyze correlations between features [FEATURE_LIST] and the target variable [TARGET_VARIABLE] in dataset [DATASET_NAME]. Provide a sorted list of features by absolute correlation value and comment on potential predictive power."
Expected Output Format: Markdown list or table with | Feature | Correlation | Interpretation |
Prompt 6: Time Series Trend Summary
Context: For datasets with temporal dimension, summarize trends and seasonality in a time series feature.
Prompt:
"Perform a trend and seasonality analysis on time series feature [TIME_FEATURE] from dataset [DATASET_NAME]. Highlight any periodic patterns, anomalies, or shifts over time."
Expected Output Format: Text summary with recommended plots (e.g., line plot, seasonal decomposition).
Prompt 7: Multivariate Distribution Insights
Context: To understand interactions between two or more features.
Prompt:
"Analyze the joint distribution of features [FEATURE_1], [FEATURE_2], and [FEATURE_3] in dataset [DATASET_NAME]. Identify clusters, outliers, and potential dependencies."
Expected Output Format: Textual insights and suggestions for scatter plots or pair plots.
Prompt 8: Missing Data Pattern Detection
Context: Detect patterns or randomness in missing data across features.
Prompt:
"Identify and describe missing data patterns in dataset [DATASET_NAME]. Indicate if missingness is random or shows patterns related to specific features or values."
Expected Output Format: Text summary with a missing data matrix heatmap suggestion.
Prompt 9: Distribution Comparison Between Groups
Context: To compare feature distributions across categories or groups.
Prompt:
"Compare the distribution of feature [FEATURE_NAME] across groups defined by categorical feature [GROUP_FEATURE] in dataset [DATASET_NAME]. Provide statistical test results (e.g., ANOVA, Kruskal-Wallis) and interpretation."
Expected Output Format: Markdown report with test statistics and conclusions.
Prompt 10: Summary of Data Quality Issues
Context: To get a consolidated report of data quality issues in dataset.
Prompt:
"Generate a detailed report of data quality issues in dataset [DATASET_NAME], covering missing values, duplicates, inconsistent formats, and outliers. Provide remediation suggestions."
Expected Output Format: Bullet list with issue descriptions and recommended actions.
2. Data Cleaning & Preprocessing
Data cleaning and preprocessing are critical to ensure model performance and reliability. GPT-5.5 can assist in identifying missing values, detecting outliers, encoding categorical variables, and standardizing formats. Using precise prompts helps automate these repetitive tasks while maintaining transparency and control.
Practical Tips for Data Cleaning Prompts: Clearly specify the dataset and columns to clean. Ask GPT-5.5 to provide code snippets in Python or SQL to directly integrate cleaning logic. Request explanations for transformations so your team understands the rationale.
Prompt 11: Missing Value Imputation Strategy
Context: To recommend strategies for imputing missing data based on feature types and distributions.
Prompt:
"Suggest the most appropriate missing value imputation methods for features [FEATURE_LIST] in dataset [DATASET_NAME]. Consider numerical and categorical types separately, and justify each choice."
Expected Output Format: Markdown list with feature, data type, recommended imputation method, and rationale.
Prompt 12: Generate Python Code for Missing Value Imputation
Context: To obtain executable code for imputing missing values.
Prompt:
"Provide Python code using pandas and scikit-learn to impute missing values in dataset [DATASET_NAME]. Use median for numerical features [NUM_FEATURES] and mode for categorical features [CAT_FEATURES]."
Expected Output Format: Python code block ready for copy-paste.
Prompt 13: Outlier Detection and Treatment
Context: To detect and handle outliers in numerical data using statistical methods.
Prompt:
"Identify outliers in features [FEATURE_LIST] of dataset [DATASET_NAME] using the IQR and Z-score methods. Provide Python code to cap or remove outliers accordingly."
Expected Output Format: Python code with comments and textual explanation of thresholds used.
Prompt 14: Encoding Categorical Variables
Context: To choose and implement optimal encoding methods (one-hot, label encoding, target encoding).
Prompt:
"Recommend encoding techniques for categorical features [FEATURE_LIST] in dataset [DATASET_NAME] based on feature cardinality and model type. Generate Python code for the recommended encoding."
Expected Output Format: Markdown explanation and Python code snippet.
Prompt 15: Data Type Standardization
Context: To detect and correct inconsistent data types or formats within columns.
Prompt:
"Analyze dataset [DATASET_NAME] for inconsistent data types or formatting issues in features [FEATURE_LIST]. Provide Python code to convert columns to appropriate types and standardize formats (e.g., dates, strings)."
Expected Output Format: Python code with explanatory comments.
Prompt 16: Duplicate Record Identification and Removal
Context: To find and remove duplicate rows or near-duplicates effectively.
Prompt:
"Detect duplicate and near-duplicate records in dataset [DATASET_NAME]. Provide Python code to remove exact duplicates and flag near-duplicates based on features [FEATURE_LIST]."
Expected Output Format: Python code with method explanation.
Prompt 17: Normalize and Scale Numerical Features
Context: To prepare numerical data for models sensitive to scale.
Prompt:
"Recommend scaling or normalization methods for numerical features [FEATURE_LIST] in dataset [DATASET_NAME]. Provide Python code to apply Min-Max scaling and Standard scaling."
Expected Output Format: Markdown rationale and Python code snippet.
Prompt 18: Handling Imbalanced Data
Context: To suggest preprocessing techniques for datasets with imbalanced classes.
Prompt:
"Analyze the class distribution of target variable [TARGET_VARIABLE] in dataset [DATASET_NAME]. Recommend strategies such as SMOTE, undersampling, or class weighting, and provide example Python code."
Expected Output Format: Markdown summary and Python code snippet.
Prompt 19: Text Data Cleaning
Context: For preprocessing textual data including tokenization, stopword removal, and normalization.
Prompt:
"Generate Python code to clean and preprocess text feature [TEXT_FEATURE] in dataset [DATASET_NAME]. Include lowercasing, punctuation removal, stopword filtering, and lemmatization steps."
Expected Output Format: Python code with usage comments.
Prompt 20: Date-Time Feature Extraction
Context: To extract meaningful components from date-time features for modeling.
Prompt:
"Provide Python code to extract year, month, day, weekday, and hour from datetime feature [DATE_FEATURE] in dataset [DATASET_NAME]. Include handling for missing or malformed dates."
Expected Output Format: Python code snippet.
3. Feature Engineering
Feature engineering transforms raw data into informative inputs that improve model accuracy. GPT-5.5 can automate suggestions for feature creation, selection, and transformation based on dataset characteristics and modeling goals.
Practical Tips for Feature Engineering Prompts: Always specify the modeling task and target variable to get context-aware feature suggestions. Ask for code that can be directly integrated into pipelines. Request explanations about why certain features are valuable.
Prompt 21: New Feature Creation Ideas
Context: To generate domain-specific features based on existing columns.
Prompt:
"Suggest 5 new engineered features for dataset [DATASET_NAME] using existing features [FEATURE_LIST]. Explain how each new feature could improve model performance for predicting [TARGET_VARIABLE]."
Expected Output Format: Markdown list with feature name, formula or logic, and justification.
Prompt 22: Interaction Feature Generation
Context: To create interaction terms between numerical or categorical variables.
Prompt:
"Generate Python code to create interaction features between numerical features [NUM_FEATURE_LIST] and categorical features [CAT_FEATURE_LIST] in dataset [DATASET_NAME]. Explain when interactions are beneficial."
Expected Output Format: Python code and textual explanation.
Prompt 23: Polynomial Feature Expansion
Context: To enhance model capacity by adding polynomial terms.
Prompt:
"Provide Python code to generate polynomial features (degree 2) for numerical features [NUM_FEATURE_LIST] in dataset [DATASET_NAME]. Include interaction terms and standardized scaling."
Expected Output Format: Python code using scikit-learn’s PolynomialFeatures.
Prompt 24: Feature Selection Based on Importance
Context: To identify important features using model-based methods.
Prompt:
"Using dataset [DATASET_NAME], target [TARGET_VARIABLE], and features [FEATURE_LIST], provide Python code to compute feature importances via a Random Forest model. Suggest top 10 features to retain."
Expected Output Format: Python code and markdown table of features with importance scores.
Prompt 25: Dimensionality Reduction Techniques
Context: To reduce feature space using PCA or t-SNE for visualization or modeling.
Prompt:
"Generate Python code to apply PCA on numerical features [NUM_FEATURE_LIST] in dataset [DATASET_NAME]. Explain how many components to keep based on explained variance."
Expected Output Format: Python code and interpretive summary.
Prompt 26: Encoding Cyclical Features
Context: To correctly encode cyclical features like hours or months.
Prompt:
"Provide Python code to encode cyclical feature [CYCLICAL_FEATURE] in dataset [DATASET_NAME] using sine and cosine transformations. Explain why this encoding is preferred."
Expected Output Format: Python code snippet and explanation.
Prompt 27: Binning Continuous Variables
Context: To discretize continuous features into meaningful bins.
Prompt:
"Suggest binning strategies and provide Python code to discretize numerical feature [FEATURE_NAME] in dataset [DATASET_NAME] into 5 bins using quantiles or domain-specific thresholds."
Expected Output Format: Code and rationale for binning method.
Prompt 28: Text Feature Vectorization
Context: To convert text data into numeric vectors for modeling.
Prompt:
"Generate Python code to vectorize text feature [TEXT_FEATURE] from dataset [DATASET_NAME] using TF-IDF with a maximum of 1000 features. Include stopword removal and n-gram range (1,2)."
Expected Output Format: Python code ready for pipeline inclusion.
Prompt 29: Handling Multicollinearity
Context: To identify and address highly correlated features.
Prompt:
"Analyze multicollinearity among features [FEATURE_LIST] in dataset [DATASET_NAME]. Recommend features to drop or combine, and provide Python code to implement this."
Expected Output Format: Markdown explanation and Python code snippet.
Prompt 30: Encoding Rare Categories
Context: To handle rare categories in categorical features effectively.
Prompt:
"Provide strategies and Python code to group rare categories (less than 1% frequency) in categorical feature [FEATURE_NAME] of dataset [DATASET_NAME] into a single 'Other' category."
Expected Output Format: Code and explanation of impact on modeling.
4. Model Selection & Training
Choosing and tuning the right model is essential for predictive performance. GPT-5.5 can assist by comparing algorithms, generating hyperparameter tuning code, and suggesting validation strategies tailored to your dataset and problem.
Practical Tips for Model Selection Prompts: Specify problem type (classification/regression), dataset size, and features. Request code outputs in popular frameworks like scikit-learn for immediate experimentation. Ask for recommendations on validation techniques to avoid overfitting.
Prompt 31: Algorithm Comparison for Classification
Context: To compare baseline performance of multiple classification algorithms.
Prompt:
"Provide Python code to train and evaluate Logistic Regression, Random Forest, Gradient Boosting, and SVM classifiers on dataset [DATASET_NAME] using features [FEATURE_LIST] and target [TARGET_VARIABLE]. Include accuracy, precision, recall, and F1-score using 5-fold cross-validation."
Expected Output Format: Python code plus markdown table with performance metrics.
Prompt 32: Algorithm Comparison for Regression
Context: To benchmark multiple regression algorithms on a dataset.
Prompt:
"Generate Python code to compare Linear Regression, Random Forest Regressor, XGBoost, and SVR on dataset [DATASET_NAME], features [FEATURE_LIST], and target [TARGET_VARIABLE]. Evaluate using RMSE and R² with 5-fold cross-validation."
Expected Output Format: Python code and summary table.
Prompt 33: Hyperparameter Tuning with Grid Search
Context: To perform hyperparameter optimization for selected models.
Prompt:
"Provide Python code using GridSearchCV to tune hyperparameters for Random Forest classifier on dataset [DATASET_NAME] with features [FEATURE_LIST] and target [TARGET_VARIABLE]. Include parameters: n_estimators, max_depth, min_samples_split."
Expected Output Format: Python code snippet and explanation of best parameters.
Prompt 34: Hyperparameter Tuning with Randomized Search
Context: To efficiently search large hyperparameter spaces.
Prompt:
"Generate Python code for RandomizedSearchCV to tune XGBoost regressor hyperparameters on dataset [DATASET_NAME]. Include parameters: learning_rate, max_depth, subsample, colsample_bytree."
Expected Output Format: Python code and summary of tuning results.
Prompt 35: Cross-Validation Strategy Recommendation
Context: To choose the most appropriate cross-validation method for the dataset.
Prompt:
"Recommend and explain the best cross-validation strategy (e.g., KFold, StratifiedKFold, TimeSeriesSplit) for dataset [DATASET_NAME] with target variable [TARGET_VARIABLE]. Justify based on data characteristics."
Expected Output Format: Text explanation with example Python code.
Prompt 36: Early Stopping Implementation
Context: To prevent overfitting during model training with iterative algorithms.
Prompt:
"Provide Python code to implement early stopping in training a LightGBM model on dataset [DATASET_NAME] with features [FEATURE_LIST] and target [TARGET_VARIABLE]. Include validation split and stopping criteria."
Expected Output Format: Python code snippet.
Prompt 37: Ensemble Model Training
Context: To combine multiple models for improved accuracy.
Prompt:
"Generate Python code to build a stacking ensemble model combining Logistic Regression, Random Forest, and XGBoost classifiers on dataset [DATASET_NAME]. Include training, validation, and performance reporting."
Expected Output Format: Python code and performance summary.
Prompt 38: Model Training Automation Script
Context: To create a reusable script for model training and evaluation.
Prompt:
"Provide a Python script template to train any scikit-learn model on dataset [DATASET_NAME] with customizable features and target variables. Include data splitting, training, evaluation metrics, and model saving."
Expected Output Format: Complete Python script.
Prompt 39: Model Performance Benchmarking
Context: To benchmark model performance across multiple datasets or feature sets.
Prompt:
"Generate Python code to benchmark a given model (e.g., Random Forest) across datasets [DATASET_LIST] with features [FEATURE_LIST]. Report performance metrics and runtime for each."
Expected Output Format: Python code and markdown table with results.
Prompt 40: Feature Importance Visualization
Context: To visualize model feature importance for interpretability.
Prompt:
"Provide Python code to plot feature importance for a trained XGBoost model on dataset [DATASET_NAME]. Use bar plots sorted by importance scores."
Expected Output Format: Python code snippet producing matplotlib or seaborn plot.
5. Results Interpretation & Reporting
Interpreting model results and effectively communicating them to stakeholders are vital for data-driven decision making. GPT-5.5 can help generate explanations, visualizations, and polished reports that translate technical findings into actionable business insights.
Practical Tips for Interpretation Prompts: Request outputs in markdown or HTML for direct report integration. Ask for explanations tailored to non-technical stakeholders when needed. Combine textual explanations with visual aids.
Prompt 41: Model Explanation Summary
Context: To create a clear summary of model performance and behavior.
Prompt:
"Generate a comprehensive summary explaining the performance of the model trained on dataset [DATASET_NAME] with target [TARGET_VARIABLE]. Include key metrics, strengths, weaknesses, and potential biases."
Expected Output Format: Markdown report suitable for technical and non-technical readers.
Prompt 42: SHAP Value Interpretation
Context: To explain individual predictions using SHAP values.
Prompt:
"Provide Python code to compute and visualize SHAP values for model predictions on dataset [DATASET_NAME]. Include explanation of how to interpret the plots."
Expected Output Format: Python code with comments and textual guidance.
Prompt 43: Confusion Matrix Visualization
Context: To visually summarize classification model errors.
Prompt:
"Generate Python code to plot confusion matrix for classifier trained on dataset [DATASET_NAME] with features [FEATURE_LIST] and target [TARGET_VARIABLE]. Include normalized values and labels."
Expected Output Format: Python code snippet for matplotlib or seaborn plot.
Prompt 44: Regression Residual Plot
Context: To diagnose regression model fit and heteroscedasticity.
Prompt:
"Provide Python code to create residual plots for regression model predictions on dataset [DATASET_NAME]. Include interpretation of patterns that indicate issues with the model."
Expected Output Format: Python code and textual explanation.
Prompt 45: Stakeholder Report Template
Context: To generate a professional report summarizing data science project results.
Prompt:
"Create a stakeholder-friendly report template summarizing dataset [DATASET_NAME], modeling approach, results, and recommendations. Use clear language and include sections for visuals, key findings, and next steps."
Expected Output Format: Markdown or HTML formatted report template.
Prompt 46: Visualization Recommendations
Context: To suggest the most effective visualizations for communicating findings.
Prompt:
"Recommend 5 types of visualizations best suited to present model results and key data insights from dataset [DATASET_NAME]. Include reasons for each choice and suggested tools or libraries."
Expected Output Format: Markdown list with description.
Prompt 47: Explain Feature Impact on Predictions
Context: To describe how specific features influence model output.
Prompt:
"Explain in simple terms how feature [FEATURE_NAME] affects model predictions for target [TARGET_VARIABLE] based on dataset [DATASET_NAME]. Provide examples or scenarios."
Expected Output Format: Text explanation suitable for business stakeholders.
Prompt 48: Generate Executive Summary
Context: To provide a concise overview of the project outcomes for executives.
Prompt:
"Write a concise executive summary of the data science project using dataset [DATASET_NAME]. Highlight objectives, methodology, main results, and strategic recommendations."
Expected Output Format: 300-400 word polished text summary.
Prompt 49: Model Limitations and Risks
Context: To identify and communicate potential limitations and risks associated with the model.
Prompt:
"List and explain the main limitations, assumptions, and risks of the model trained on dataset [DATASET_NAME]. Suggest mitigation strategies where possible."
Expected Output Format: Markdown list with detailed descriptions.
Prompt 50: Automated Report Generation Script
Context: To automate comprehensive report creation combining text and visuals.
Prompt:
"Provide a Python script template that generates a complete data science report for dataset [DATASET_NAME], including EDA summaries, model performance, visualizations, and conclusions. Format output as an HTML or PDF file."
Expected Output Format: Python script with comments and library dependencies.
| Section | Number of Prompts | Key Focus Areas | Output Formats |
|---|---|---|---|
| Exploratory Data Analysis | 10 | Profiling, distribution, correlation, missing data patterns | Markdown tables, JSON, textual summaries |
| Data Cleaning & Preprocessing | 10 | Imputation, outlier detection, encoding, scaling | Python code snippets, markdown explanations |
| Feature Engineering | 10 | Feature creation, transformation, selection, dimensionality reduction | Python code, markdown lists, explanations |
| Model Selection & Training | 10 | Algorithm benchmarking, hyperparameter tuning, validation strategies | Python code, performance tables |
| Results Interpretation & Reporting | 10 | Model explanation, visualization, stakeholder communication | Markdown reports, Python visualization code, text summaries |
Integrating these 50 GPT-5.5 prompts into your data science workflows can significantly accelerate analysis, improve data quality, enhance feature sets, optimize model performance, and streamline reporting. Customize each prompt using the provided [VARIABLE] placeholders to fit your specific datasets and projects. For further learning on advanced prompt engineering and AI-enhanced data science, explore these resources:
For professionals looking to maximize their AI productivity, our comprehensive guide on Codex Sites Prompts Masterclass: 40 Advanced Prompts for Building SaaS Dashboards, Internal Tools, and Client Portals provides dozens of tested, production-ready prompts that deliver consistent results across complex workflows and real-world business scenarios.
Access 40,000+ AI Prompts for ChatGPT, Claude & Codex — Free!
Subscribe to get instant access to our complete Notion Prompt Library — the largest curated collection of prompts for ChatGPT, Claude, OpenAI Codex, and other leading AI models. Optimized for real-world workflows across coding, research, content creation, and business.
,
For professionals looking to maximize their AI productivity, our comprehensive guide on 50 GPT-5.5 Prompts for Startup Founders: Pitch Decks, Business Plans, Fundraising, and Growth Strategy provides dozens of tested, production-ready prompts that deliver consistent results across complex workflows and real-world business scenarios.
, and
Developers seeking hands-on implementation guidance will find our detailed walkthrough on The Death of $29/Month AI Coding: How Token-Based Billing Is Reshaping Developer Tools in 2026 invaluable for understanding the complete setup process, configuration options, and best practices for production deployment.
.


