Friday, March 2, 2018

pandas - Python still having issues with try-except clause

I am using the tld python library to grab the first level domain from the proxy request logs using a apply function. When I run into a strange request that tld doesnt know how to handle like 'http:1 CON' or 'http:/login.cgi%00' I run into an error message like the following:

TldBadUrl: Is not a valid URL http:1 con!
TldBadUrlTraceback (most recent call last)
in engine
----> 1 new_fld_column = request_2['request'].apply(get_fld)

/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)

2357 if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)()

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in get_fld(url,
fail_silently, fix_protocol, search_public, search_private, **kwargs)
385 fix_protocol=fix_protocol,
386 search_public=search_public,
--> 387 search_private=search_private

388 )

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in process_url(url, fail_silently, fix_protocol, search_public, search_private)
289 return None, None, parsed_url
290 else:
--> 291 raise TldBadUrl(url=url)
293 domain_parts = domain_name.split('.')

To overcome this it was suggested to me to wrap the function in a try-except clause to determine the rows that error out by querying them with NaN:

import tld
from tld import get_fld

def try_get_fld(x):
return get_fld(x)
except tld.exceptions.TldBadUrl:

return np.nan

This seems to work for some of the "requests" like "http:1 con" and "http:/login.cgi%00" but then fails for "http://urnt12.knhc..txt/" where I get another error message like the one above:

TldDomainNotFound: Domain urnt12.knhc..txt didn't match any existing TLD name!

This is what the dataframe looks like total of 240,000 "requests" in a dataframe called "request":

request count
0 24521
1 11521
2 6252
3 65225
4 7852222
5 12
6 http:1 CON 6
7 http:/login.cgi%00 45822

8 http://urnt12.knhc..txt/ 1

My code:

from tld import get_tld
from tld import get_fld
import pandas as pd
import numpy as np
#Read back into to dataframe

request = pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove rows where there were null values in the request column
request = request[pd.notnull(request['request'])]
#Find the urls that contain IP addresses and exclude them from the new dataframe
request = request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset index
request = request.reset_index(drop=True)

import tld
from tld import get_fld

def try_get_fld(x):
return get_fld(x)
except tld.exceptions.TldBadUrl:
return np.nan

request['flds'] = request['request'].apply(try_get_fld)

#faulty_url_df = request[request['flds'].isna()]



It fails because it's a different exception. You expect a tld.exceptions.TldBadUrl: exception but get a TldDomainNotFound

You can either be less specific in your except clause and catch more exception with one except clause or add another except clause to catch the other type of exception:

return get_fld(x)
except tld.exceptions.TldBadUrl:

return np.nan
except tld.exceptions.TldDomainNotFound:
print("Domain not found!")
return np.nan

No comments:

Post a Comment

plot explanation - Why did Peaches' mom hang on the tree? - Movies & TV

In the middle of the movie Ice Age: Continental Drift Peaches' mom asked Peaches to go to sleep. Then, she hung on the tree. This parti...