I have a dataset in stata that is about 5.8 million rows(records).
I've been learning pandas the past few months and really enjoy its capabilities. Would pandas still work in this scenario?
I am having trouble reading the dataset into a dataframe. I'm currently looking at chunking... chunks = pd.read_stata('data.dta', chunksize = 100000, columns = ['year','race', 'app'])
Is there a better way to go about this? I am hoping to do something like:
df = pd.read_stata('data.dta')
data = df.groupby(['year', 'race']).agg(sum)
data.to_csv('data.csv')
but that does not work because (i think) the dataset is too large. error: OverflowError: Python int too large to convert to C long
Thanks. Cheers
No comments:
Post a Comment