I have to get list of unique ASN Organizations for a country sorted by no. of IPs
Example - US has ~31K AS and Amazon has many AS in US. I want to get a list of AS organizations in US sorted by no. of IPs they own.
Here is what I have come up with, not tested, not a coder -
import pandas as pd
# Read the CSV file into a DataFrame
csv_file = 'your_csv_file.csv' # Replace with the actual file path
data = pd.read_csv(csv_file, delimiter='\t')
# Filter data for a specific country code
country_code = 'US'
filtered_data = data[data['country'] == country_code]
# Group by ASN organization and calculate the total number of IPs
grouped_data = filtered_data.groupby('asn_organization')['end_ip'].count().reset_index()
grouped_data.rename(columns={'end_ip': 'num_ips'}, inplace=True)
# Sort the data by the number of IPs in descending order
sorted_data = grouped_data.sort_values(by='num_ips', ascending=False)
# Specify the output file path
output_file = 'output.txt' # Replace with the desired file path
# Save the sorted ASN organizations and their respective total IP counts to a text file
with open(output_file, 'w') as file:
for index, row in sorted_data.iterrows():
asn_organization = row['asn_organization']
total_ips = filtered_data[filtered_data['asn_organization'] == asn_organization]['end_ip'].count()
file.write(f"ASN Organization: {asn_organization}\t Total Number of IPs: {total_ips}\n")
print(f"Results saved to {output_file}")
I have also attached the groupby AS_name output on the gist.
Please note something important. IPv6 ranges tend to be huge. So, the data is skewed and will not provide you with the insight you might be looking for.
So, for that reason, I highly recommend you to filter the data initially by IPv4 addresses with a statement like this:
# df containing only IPv4 ranges
df[df['end_ip'].str.contains(".", regex=False)]
In case the gist doesnât load:
import pandas as pd # For reading the CSV and working with the DB
import ipaddress # For working with IPs
df = pd.read_csv("./country_asn.csv", na_filter=False)
# df containing only IPv4 ranges
# df = df[df['end_ip'].str.contains(".", regex=False)]
# Filter the IP db by country
country = 'US'
country_df = df[df['country'] == country]
# Function to count IP addresses
def count_ips_fn(row):
start_ip = row['start_ip']
end_ip = row['end_ip']
convert_to_int = lambda ip: int(ipaddress.ip_address(str(ip)))
return convert_to_int(end_ip) - convert_to_int(start_ip) + 1
# Applying count function to creating the column `num_ips`
country_df['num_ips'] = country_df.apply(count_ips_fn, axis=1)
# Group By Sum
grouped_df = country_df.groupby('as_name')['num_ips'].sum()
The grouped_df contains your desired output. You can save the file with the following command:
> python ./ASN_Organization_List.py
...\ASN_Organization_List.py:24: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
country_df['num_ips'] = country_df.apply(count_ips_fn, axis=1)
Tried change line to this but still same results -
2.2 No. of IPs Column shows numbers like 9.89517002353123E+31
3 Question
Please note something important. IPv6 ranges tend to be huge. So, the data is skewed and will not provide you with the insight you might be looking for.
So, for that reason, I highly recommend you to filter the data initially by IPv4 addresses with a statement like this:
# df containing only IPv4 ranges
df[df['end_ip'].str.contains(".", regex=False)]
So donât use no of IPv6 IPs to sort but donât ignore ASN with IPv6
4 Also Updated code:
I added somethings like it asks for country code and changes the output file accordingly
# For reading the CSV and working with the DB
import pandas as pd
# For working with IPs
import ipaddress
df = pd.read_csv("./country_asn.csv", na_filter=False)
# df containing only IPv4 ranges
# df = df[df['end_ip'].str.contains(".", regex=False)]
country = input("Enter Country: ")
# Filter the IP db by country
country_df = df[df['country'] == country]
# Function to count IP addresses
def count_ips_fn(row):
start_ip = row['start_ip']
end_ip = row['end_ip']
convert_to_int = lambda ip: int(ipaddress.ip_address(str(ip)))
return convert_to_int(end_ip) - convert_to_int(start_ip) + 1
# Applying count function to creating the column `num_ips`
#country_df['num_ips'] = country_df.apply(count_ips_fn, axis=1)
country_df.loc[:, 'num_ips'] = country_df.apply(count_ips_fn, axis=1)
# Group By Sum
grouped_df = country_df.groupby('as_name')['num_ips'].sum()
#grouped_df contains desired output. save file with following command:
filename = f"{country}_AS_Organizations.csv"
grouped_df.reset_index().to_csv(filename, index=False)
No. of IPs Column shows numbers like 9.89517002353123E+31
Yep. That is because of IPv6 ranges. They are massive. That is why I recommended counting only IPv4 ranges.
Here is your code with slight update:
# For reading the CSV and working with the DB
import pandas as pd
# For working with IPs
import ipaddress
country_asn_df = pd.read_csv("./country_asn.csv", na_filter=False)
# df containing only IPv4 ranges
df = country_asn_df[country_asn_df['end_ip'].str.contains(".", regex=False)]
country = input("Enter Country: ")
# Filter the IP db by country
country_df = df[df['country'] == country].copy()
# Function to count IP addresses
def count_ips_fn(row):
start_ip = row['start_ip']
end_ip = row['end_ip']
convert_to_int = lambda ip: int(ipaddress.ip_address(str(ip)))
return convert_to_int(end_ip) - convert_to_int(start_ip) + 1
# Applying count function to creating the column `num_ips`
#country_df['num_ips'] = country_df.apply(count_ips_fn, axis=1)
country_df.loc[:, 'num_ips'] = country_df.apply(count_ips_fn, axis=1)
# Group By Sum
grouped_df = country_df.groupby('as_name')['num_ips'].sum()
#grouped_df contains desired output. save file with following command:
filename = f"{country}_AS_Organizations.csv"
grouped_df.reset_index().sort_values(by="num_ips").to_csv(filename, index=False)
Please note something important. IPv6 ranges tend to be huge. So, the data is skewed and will not provide you with the insight you might be looking for.
So, for that reason, I highly recommend you to filter the data initially by IPv4 addresses with a statement like this:
# df containing only IPv4 ranges
df[df['end_ip'].str.contains(".", regex=False)]
So using this will donât use no of IPv6 IPs to sort or also ignore ASN with IPv6 only ?
Another thing you can do is count IPv6 ranges separately. If you separate IPv4 and IPv6 you will get a better understanding of the ASN IP range ownership. To do this, replace this code:
What ? Let me rephrase - So using this will donât use no of IPv6 IPs to sort or will it also ignore ASNâs IPv6 which will cause problem if ASN is IPv6 only ?
You didnât understand me df[df['end_ip'].str.contains(".", regex=False)] ignore all IPv6 Range which may also exclude ASNs with only IPv6 because Ipv6 ranges are large and sorting them is a problem we should ignore itâs in counting IPs not ignore its creating list of ASN with the country
So using this will donât use no of IPv6 IPs to sort or will it also ignore ASNâs IPv6 which will cause problem if ASN is IPv6 only ?
Yes, this code only keeps the IPv4 ranges
df[df['end_ip'].str.contains(".", regex=False)] # "." is included only in IPv4
So, consequentially if any ASN is only using IPv6 their ranges will not be counted. This command only will include IPv4 ranges.
But, considering how messy IPv6 addresses are I recommended that you create an IPv6 dataset separately, where you donât include IPv4 ranges. For that I suggest replacing the previous command with the following:
df[df[âend_ipâ].str.contains(â:â, regex=False)] # ":" is included only in IPv6
By using the above command you can create a IPv6 only dataset.
This ignores all IPv6 which will ignore ASNs if they are only IPv6.
Reason to use df = country_asn_df[country_asn_df['end_ip'].str.contains(".", regex=False)] is that IPv6 is too big and it creates a problem (which I canât understand). Is there a way to consider IPv6 for Listing ASN but ignore them for counting no. of IPs
For IPv6-only ASNs, would it be acceptable for the total IP_COUNT value to be 0? In this case, the ASN would be listed in the output, but the IP_COUNT would be 0 to indicate that they only have IPv6 addresses.