python 2.7 - Pandas for Large Data Sets: Millions of records

Sunday, May 26, 2019

python 2.7 - Pandas for Large Data Sets: Millions of records

I have a dataset in stata that is about 5.8 million rows(records).

I've been learning pandas the past few months and really enjoy its capabilities. Would pandas still work in this scenario?

I am having trouble reading the dataset into a dataframe. I'm currently looking at chunking... chunks = pd.read_stata('data.dta', chunksize = 100000, columns = ['year','race', 'app'])

Is there a better way to go about this? I am hoping to do something like:

df = pd.read_stata('data.dta')
data = df.groupby(['year', 'race']).agg(sum)
data.to_csv('data.csv')

but that does not work because (i think) the dataset is too large. error: OverflowError: Python int too large to convert to C long

Thanks. Cheers

Blog

Sunday, May 26, 2019