Data Extraction From Tabular Data With ChatGPT

Pragnakalp Techlabs
5 min readFeb 22, 2024

Introduction

In the world of data analysis, extracting useful information from tabular data can be a difficult task. Conventional approaches typically require manual exploration and analysis of data, which can be requires a significant amount of effort, time, or workforce to complete.

The emergence of advanced language models such as ChatGPT has introduced a promising and innovative approach to extracting useful information from tabular data.

OpenAI announced an official API for ChatGPT which is powered by gpt-3.5-turbo, OpenAI’s most advanced language model.

This blog describes you the process of extracting useful information from tabular data using ChatGPT API.

Dataset

For all illustrations in this post, We will be utilizing the following data. These data are taken from the following: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams

Note that we have only taken into account 30 records from the dataset.

Please see the data provided below, which will be used for the purpose of this blog.

Data Extraction using ChatGPT API

ChatGPT relies solely on natural language processing (NLP) techniques to understand and extract information from tabular data. It can analyze the text-based input provided by the user, interpret the query, and generate a response based on the content of the tabular data.

Here’s an example of how you can utilize the ChatGPT API to extract information from tabular data:

Step 1: Prepare Input

We have stored our tabular data in a CSV file, you can read the CSV file using “Pandas” python library and pass the data to the ChatGPT API for information extraction.

import pandas as pd

read_csv=pd.read_csv("Student.csv")

Step 2: Use the ChatGPT API

Before we begin utilizing the ChatGPT API, please make sure that you have installed OpenAI Python library in your system.

pip install openai

You can extract information by providing the tabular data and input text to the ChatGPT API. This can be done by reading the tabular data from a CSV file, preparing the input for the API, and passing it along with the input text. The API will then extract the relevant information from the data and provide it in the response.

import openai
import pandas as pd

openai.api_key = '<YOUR OPENAI API KEY>'

read_csv = pd.read_csv("Student.csv")


input_text='''What is the average math score for male students?'''

prompt = """Please regard the following data:\n {}. Answer the following question and please return only value: {}""".format(read_csv, input_text)

request = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0301",
messages=[
{"role": "user", "content": prompt},
]
)
result = request['choices'][0]['message']['content']

print("ChatGPT Response=>",result)

The following is the response that was received from the ChatGPT API:

Now let’s review the responses to a few questions based on the extracted information.

We observed that ChatGPT is incapable of performing aggregations, such as summing or averaging the table entries, and occasionally struggles to respond to basic questions.

You can add your table and test it right away using ChatGPT playground without using the API. Please visit the below link
https://chat.openai.com/

SQL-based Data Extraction from Database using ChatGPT API

We can now utilize ChatGPT again, but this time not for direct table analysis. Instead, we can use ChatGPT to generate SQL statements for a database that contains the data. SQL is capable of handling filtering, sorting, aggregation, and summation logic, which can help in performing the required calculations accurately.

We utilized SQLite as the database engine and employed the sqlite3 Python library to interact with it.

Step 1: Create SQLite database and table

Here’s a Python code that creates a database and a table in SQLite:

import sqlite3

# Connect to SQLite database (this will create a new database file if it doesn't exist)
conn = sqlite3.connect("chatgpt.db")
cursor = conn.cursor()

# Create a table
cursor.execute("""
CREATE TABLE IF NOT EXISTS student (
gender TEXT,
race TEXT,
parentallevelofeducation TEXT,
lunch TEXT,
testpreparationcourse TEXT,
mathscore INTEGER,
readingscore INTEGER,
writingscore INTEGER
)
""")

# Commit the transaction and close the connection
conn.commit()
conn.close()

Step 2: Adding Data to a Database

Here’s a Python code that read data from CSV file using pandas and inserts it into a SQLite database:

import sqlite3
import pandas as pd

df=pd.read_csv("Student.csv")
# Connect to SQLite database
conn = sqlite3.connect('chatgpt.db')

# Insert DataFrame into SQLite database
df.to_sql('student', conn, if_exists='replace', index=False)

# Close database connection
conn.close()

Step 3: Use ChatGPT API

You can extract information by providing the database table name, it’s corresponding columns and input text to the ChatGPT API. ChatGPT API will generate SQL query from given input text and then use that query to retrieve data from the database.

Here’s an example of how you can extract information using the ChatGPT API by providing the database table name, its corresponding columns, and input text:

import sqlite3
import openai

# Connect to SQLite database
conn = sqlite3.connect('chatgpt.db')
cursor = conn.cursor()


openai.api_key = '<YOUR OPENAI API KEY>'

# Function to get table columns from SQLite database
def get_table_columns(table_name):
cursor.execute("PRAGMA table_info({})".format(table_name))
columns = cursor.fetchall()
print(columns)
return [column[1] for column in columns]


# Function to generate SQL query from input text using ChatGPT
def generate_sql_query(table_name,text,columns):
prompt = """You are a ChatGPT language model that can generate SQL queries. Please provide a natural language input text, and I will generate the corresponding SQL query for you.The table name is {} and corresponding columns are {}.\nInput: {}\nSQL Query:""".format(table_name,columns,text)
print(prompt)
request = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0301",
messages=[
{"role": "user", "content": prompt},
]
)
sql_query = request['choices'][0]['message']['content']
return sql_query

# Function to execute SQL query on SQLite database
def execute_sql_query(query):
cursor.execute(query)
result = cursor.fetchall()
return result


text="What is the average math score for male students?"

table_name = 'student'
columns = get_table_columns(table_name)
sql_query=generate_sql_query(table_name,text,columns)
print("Generated SQL query: ",sql_query)
if sql_query:
result=execute_sql_query(sql_query)
print("ChatGPT Response=>",result)

# Close database connection
cursor.close()
conn.close()

After sending the input text to the ChatGPT API, you will receive a response containing the generated SQL query. You can then use this query to retrieve the desired data from the database.

The response to this approach looks as below:

ChatGPT without SQL relies solely on NLP techniques to understand and generate responses based on tabular data, which may result in incorrect responses. However, when ChatGPT is combined with SQL capabilities, it can perform more advanced and flexible interactions with tabular data by directly executing SQL queries. This allows for more accurate and precise retrieval of information from databases, enhancing the overall capabilities of ChatGPT in dealing with tabular data.

--

--

Pragnakalp Techlabs

Chatbots Development, Python Programming, Natural Language Processing (NLP), Machine Learning Solutions. https://www.pragnakalp.com