Sunday, May 26, 2019

python 2.7 - Pandas for Large Data Sets: Millions of records

I have a dataset in stata that is about 5.8 million rows(records).


I've been learning pandas the past few months and really enjoy its capabilities. Would pandas still work in this scenario?


I am having trouble reading the dataset into a dataframe. I'm currently looking at chunking... chunks = pd.read_stata('data.dta', chunksize = 100000, columns = ['year','race', 'app'])


Is there a better way to go about this? I am hoping to do something like:


df = pd.read_stata('data.dta')
data = df.groupby(['year', 'race']).agg(sum)
data.to_csv('data.csv')

but that does not work because (i think) the dataset is too large. error: OverflowError: Python int too large to convert to C long


Thanks. Cheers

No comments:

Post a Comment

plot explanation - Why did Peaches' mom hang on the tree? - Movies & TV

In the middle of the movie Ice Age: Continental Drift Peaches' mom asked Peaches to go to sleep. Then, she hung on the tree. This parti...