Script to get list of ASN Organizations

I have to get list of unique ASN Organizations for a country sorted by no. of IPs

Example - US has ~31K AS and Amazon has many AS in US. I want to get a list of AS organizations in US sorted by no. of IPs they own.

Here is what I have come up with, not tested, not a coder -

import pandas as pd

# Read the CSV file into a DataFrame
csv_file = 'your_csv_file.csv'  # Replace with the actual file path
data = pd.read_csv(csv_file, delimiter='\t')

# Filter data for a specific country code
country_code = 'US'
filtered_data = data[data['country'] == country_code]

# Group by ASN organization and calculate the total number of IPs
grouped_data = filtered_data.groupby('asn_organization')['end_ip'].count().reset_index()
grouped_data.rename(columns={'end_ip': 'num_ips'}, inplace=True)

# Sort the data by the number of IPs in descending order
sorted_data = grouped_data.sort_values(by='num_ips', ascending=False)

# Specify the output file path
output_file = 'output.txt'  # Replace with the desired file path

# Save the sorted ASN organizations and their respective total IP counts to a text file
with open(output_file, 'w') as file:
    for index, row in sorted_data.iterrows():
        asn_organization = row['asn_organization']
        total_ips = filtered_data[filtered_data['asn_organization'] == asn_organization]['end_ip'].count()
        file.write(f"ASN Organization: {asn_organization}\t Total Number of IPs: {total_ips}\n")

print(f"Results saved to {output_file}")

1 Like

Hi!

Thank you for posting on the IPinfo community.

Here is the code:

:link: Script to get list of ASN Organizations · GitHub

I have also attached the groupby AS_name output on the gist.

Please note something important. IPv6 ranges tend to be huge. So, the data is skewed and will not provide you with the insight you might be looking for.

So, for that reason, I highly recommend you to filter the data initially by IPv4 addresses with a statement like this:

# df containing only IPv4 ranges
df[df['end_ip'].str.contains(".", regex=False)]

In case the gist doesn’t load:

import pandas as pd # For reading the CSV and working with the DB
import ipaddress # For working with IPs

df = pd.read_csv("./country_asn.csv", na_filter=False)

# df containing only IPv4 ranges
# df = df[df['end_ip'].str.contains(".", regex=False)]

# Filter the IP db by country
country = 'US'
country_df = df[df['country']  == country]

# Function to count IP addresses
def count_ips_fn(row):
    start_ip = row['start_ip']
    end_ip = row['end_ip']
    convert_to_int = lambda ip: int(ipaddress.ip_address(str(ip)))
    return convert_to_int(end_ip) - convert_to_int(start_ip) + 1
                                               

# Applying count function to creating the column `num_ips`
country_df['num_ips'] = country_df.apply(count_ips_fn, axis=1)


# Group By Sum
grouped_df = country_df.groupby('as_name')['num_ips'].sum()

The grouped_df contains your desired output. You can save the file with the following command:

grouped_df.reset_index().to_csv("count_ips_us_as.csv", index=False)

Let me know if you have any questions. Thanks!

1 Getting this error -

> python ./ASN_Organization_List.py
...\ASN_Organization_List.py:24: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  country_df['num_ips'] = country_df.apply(count_ips_fn, axis=1)

Tried change line to this but still same results -

country_df.loc[:, 'num_ips'] = country_df.apply(count_ips_fn, axis=1)

2 List is not sorted

2.2 No. of IPs Column shows numbers like 9.89517002353123E+31

3 Question

Please note something important. IPv6 ranges tend to be huge. So, the data is skewed and will not provide you with the insight you might be looking for.
So, for that reason, I highly recommend you to filter the data initially by IPv4 addresses with a statement like this:

# df containing only IPv4 ranges
df[df['end_ip'].str.contains(".", regex=False)]

So don’t use no of IPv6 IPs to sort but don’t ignore ASN with IPv6

4 Also Updated code:

I added somethings like it asks for country code and changes the output file accordingly

# For reading the CSV and working with the DB
import pandas as pd
# For working with IPs
import ipaddress

df = pd.read_csv("./country_asn.csv", na_filter=False)

# df containing only IPv4 ranges
# df = df[df['end_ip'].str.contains(".", regex=False)]

country = input("Enter Country: ")

# Filter the IP db by country
country_df = df[df['country']  == country]

# Function to count IP addresses
def count_ips_fn(row):
    start_ip = row['start_ip']
    end_ip = row['end_ip']
    convert_to_int = lambda ip: int(ipaddress.ip_address(str(ip)))
    return convert_to_int(end_ip) - convert_to_int(start_ip) + 1                                               

# Applying count function to creating the column `num_ips`
#country_df['num_ips'] = country_df.apply(count_ips_fn, axis=1)
country_df.loc[:, 'num_ips'] = country_df.apply(count_ips_fn, axis=1)

# Group By Sum
grouped_df = country_df.groupby('as_name')['num_ips'].sum()

#grouped_df contains  desired output. save file with following command:
filename = f"{country}_AS_Organizations.csv"
grouped_df.reset_index().to_csv(filename, index=False)

A value is trying to be set on a copy of a slice from a DataFrame.

Oh yeah. My bad. I was using Jupyter notebook which usually doesn’t show copy errors like this.

:link: python - SettingWithCopyWarning even when using .loc[row_indexer,col_indexer] = value - Stack Overflow

No. of IPs Column shows numbers like 9.89517002353123E+31

Yep. That is because of IPv6 ranges. They are massive. That is why I recommended counting only IPv4 ranges.

Here is your code with slight update:

# For reading the CSV and working with the DB
import pandas as pd
# For working with IPs
import ipaddress

country_asn_df = pd.read_csv("./country_asn.csv", na_filter=False)

# df containing only IPv4 ranges
df = country_asn_df[country_asn_df['end_ip'].str.contains(".", regex=False)]


country = input("Enter Country: ")

# Filter the IP db by country
country_df = df[df['country']  == country].copy()

# Function to count IP addresses
def count_ips_fn(row):
    start_ip = row['start_ip']
    end_ip = row['end_ip']
    convert_to_int = lambda ip: int(ipaddress.ip_address(str(ip)))
    return convert_to_int(end_ip) - convert_to_int(start_ip) + 1                                               

# Applying count function to creating the column `num_ips`
#country_df['num_ips'] = country_df.apply(count_ips_fn, axis=1)
country_df.loc[:, 'num_ips'] = country_df.apply(count_ips_fn, axis=1)

# Group By Sum
grouped_df = country_df.groupby('as_name')['num_ips'].sum()

#grouped_df contains  desired output. save file with following command:
filename = f"{country}_AS_Organizations.csv"
grouped_df.reset_index().sort_values(by="num_ips").to_csv(filename, index=False)

Please note something important. IPv6 ranges tend to be huge. So, the data is skewed and will not provide you with the insight you might be looking for.
So, for that reason, I highly recommend you to filter the data initially by IPv4 addresses with a statement like this:

# df containing only IPv4 ranges
df[df['end_ip'].str.contains(".", regex=False)]

So using this will don’t use no of IPv6 IPs to sort or also ignore ASN with IPv6 only ?

@Abdullah

Yes.

Another thing you can do is count IPv6 ranges separately. If you separate IPv4 and IPv6 you will get a better understanding of the ASN IP range ownership. To do this, replace this code:

df[df[‘end_ip’].str.contains(“.”, regex=False)]

with this:

df[df[‘end_ip’].str.contains(“:”, regex=False)]

Because IPv4 contains . and IPv6 contains :. With this replacement code, you will get only IPv6 ranges.

Yes

What ? Let me rephrase - So using this will don’t use no of IPv6 IPs to sort or will it also ignore ASN’s IPv6 which will cause problem if ASN is IPv6 only ?

You didn’t understand me df[df['end_ip'].str.contains(".", regex=False)] ignore all IPv6 Range which may also exclude ASNs with only IPv6 because Ipv6 ranges are large and sorting them is a problem we should ignore it’s in counting IPs not ignore its creating list of ASN with the country

So using this will don’t use no of IPv6 IPs to sort or will it also ignore ASN’s IPv6 which will cause problem if ASN is IPv6 only ?

Yes, this code only keeps the IPv4 ranges

df[df['end_ip'].str.contains(".", regex=False)] # "." is included only in IPv4

So, consequentially if any ASN is only using IPv6 their ranges will not be counted. This command only will include IPv4 ranges.

But, considering how messy IPv6 addresses are I recommended that you create an IPv6 dataset separately, where you don’t include IPv4 ranges. For that I suggest replacing the previous command with the following:

df[df[‘end_ip’].str.contains(“:”, regex=False)] # ":" is included only in IPv6

By using the above command you can create a IPv6 only dataset.

How to - just ignore IPv6 for sorting but use it for creating ASN list ?

Hmmm. Not sure. When you are counting the IP ranges, the function count_ips_fn can count IPs for both IPv4 and IPv6 addresses.

Maybe you didn’t understand. I don’t understand what you said -

When you are counting the IP ranges, the function count_ips_fn can count IPs for both IPv4 and IPv6 addresses.

df = country_asn_df[country_asn_df['end_ip'].str.contains(".", regex=False)]

This ignores all IPv6 which will ignore ASNs if they are only IPv6.
Reason to use df = country_asn_df[country_asn_df['end_ip'].str.contains(".", regex=False)] is that IPv6 is too big and it creates a problem (which I can’t understand). Is there a way to consider IPv6 for Listing ASN but ignore them for counting no. of IPs

For IPv6-only ASNs, would it be acceptable for the total IP_COUNT value to be 0? In this case, the ASN would be listed in the output, but the IP_COUNT would be 0 to indicate that they only have IPv6 addresses.

That is would be be OK but what if the script first finds IPv6 only ASNs list them separately then continue with IPv4 only script like before

That is would be be OK but what if the script first finds IPv6 only ASNs list them separately then continue with IPv4 only script like before

The count operation function could be modified while you keep the IPs as is:

def count_ips_fn(row):
    if "." in row['start_ip']:
        start_ip = row['start_ip']
        end_ip = row['end_ip']
        convert_to_int = lambda ip: int(ipaddress.ip_address(str(ip)))
        return convert_to_int(end_ip) - convert_to_int(start_ip) + 1      
    else:
        return 0

The code is untested though.

import pandas as pd
# For working with IPs
import ipaddress

country_asn_df = pd.read_csv("./country_asn.csv", na_filter=False)

# df containing only IPv4 ranges
# df = country_asn_df[country_asn_df['end_ip'].str.contains(".", regex=False)]
# df containing only IPv6 ranges
# df = country_asn_df[country_asn_df['end_ip'].str.contains(":", regex=False)]


country = input("Enter Country (eg. US): ")

# Filter the IP db by country
country_df = df[df['country']  == country].copy()

# Function to count IP addresses
# Old FUntion
#def count_ips_fn(row):
#    start_ip = row['start_ip']
#    end_ip = row['end_ip']
#    convert_to_int = lambda ip: int(ipaddress.ip_address(str(ip)))
#    return convert_to_int(end_ip) - convert_to_int(start_ip) + 1   
# New FUntion
def count_ips_fn(row):
    if "." in row['start_ip']:
        start_ip = row['start_ip']
        end_ip = row['end_ip']
        convert_to_int = lambda ip: int(ipaddress.ip_address(str(ip)))
        return convert_to_int(end_ip) - convert_to_int(start_ip) + 1      
    else:
        return 0                                            

# Applying count function to creating the column `num_ips`
#country_df['num_ips'] = country_df.apply(count_ips_fn, axis=1)
country_df.loc[:, 'num_ips'] = country_df.apply(count_ips_fn, axis=1)

# Group By Sum
grouped_df = country_df.groupby('as_name')['num_ips'].sum()

#grouped_df contains  desired output. save file with following command:
filename = f"{country}_AS_Organizations.csv"
grouped_df.reset_index().sort_values(by="num_ips").to_csv(filename, index=False)```
    country_df = df[df['country']  == country].copy()
                 ^^
NameError: name 'df' is not defined

This is a basic Python issue. Please kindly read the documentation:

Built-in Exceptions — Python 3.11.5 documentation