Distribution Table

In the trader product, distribution table for price is shown. This page explains how this table is created.

Frontend

The table can be formed for a specific crop variety for all mandi or for a specific mandi-crop-variety level.

Concept

Show price distribution based on percentile.

If there are extreme higher or lower prices then isolate them in different bins.

Code

To see the code, please expand this.
def calculate_crop_price_data(prices):

    # Calculate the minimum and maximum prices from the dataset

    min_price = min(prices)
    max_price = max(prices)
    
    # Calculate the first, second (median), and third quartiles of the prices

    q1, q2, q3 = np.percentile(prices, [25, 50, 75])
    
    # Calculate the Interquartile Range (IQR) as the difference between Q3 and Q1

    iqr = q3 - q1
    
    # Determine the lower and upper whiskers for outlier identification

    lower_whisker = q1 - 1.5 * iqr
    upper_whisker = q3 + 1.5 * iqr
    
    # Initialize the list to store bin ranges
    
    bins = []
    
    # If the max price is an outlier, create a bin for outliers above the upper whisker

    if max_price > upper_whisker:
        bins.append((max_price, upper_whisker))
        bins.append((upper_whisker, q3))
    else: # Otherwise, the highest bin goes from the max price to Q3
        bins.append((max_price, q3))
        
    # Add the middle bins based on quartiles

    bins.append((q3, q2))
    bins.append((q2, q1))
    
    # If the min price is an outlier, create a bin for outliers below the lower whisker

    if min_price < lower_whisker:
        bins.append((q1, lower_whisker))
        bins.append((lower_whisker, min_price))
    else: # Otherwise, the lowest bin goes from Q1 to the min price
        bins.append((q1, min_price))
        
    # Initialize list to store the count of prices in each bin
    
    bin_data = []
    
    # Loop through each bin to count how many prices fall within its range
    
    for bin_bounds in bins:
    
     # For the first bin (which might contain the max price), include the upper limit in the count
    
    if bin_bounds == bins[0]: # This condition checks if we're dealing with the first bin
        count = sum(1 for price in prices if bin_bounds[1] <= price <= bin_bounds[0])
    else: # For all other bins, the upper limit is exclusive
        count = sum(1 for price in prices if bin_bounds[1] <= price < bin_bounds[0])
    
    # Append the bin information along with the count to the bin_data list
    
    bin_data.append(
        {
            "lower_limit": math.floor(bin_bounds[1]), # Use floor to ensure the lower limit is an integer
            "upper_limit": math.ceil(bin_bounds[0]) - 1, # Use ceil then subtract 1 to make upper limit exclusive
            "count": count,
        }
    )
    
    # Compile the calculated data into a dictionary for easy access
    
    crop_price_data = {
        "min_price": min_price, # Minimum price in the dataset
        "max_price": max_price, # Maximum price in the dataset
        "lower_whisker_equals_min": str(min_price == lower_whisker), # Check if min price is the lower whisker
        "upper_whisker_equals_max": str(max_price == upper_whisker), # Check if max price is the upper whisker
        "bins": bin_data, # List of bins with their counts
    }
    
    # Return the compiled data
    
    return crop_price_data
    

Definations

  • Quartiles (Q1, Q2, Q3): These divide the data into four equal parts. Q1 is the median of the lower half of the data, Q2 is the median of all data, and Q3 is the median of the upper half.

  • Interquartile Range (IQR): This is the difference between Q3 and Q1 and represents the middle 50% of the data.

  • Whiskers: These extend from Q1 and Q3 to the lowest and highest data points within 1.5 times the IQR. They help in identifying outliers.

  • Bins are defined intervals or ranges of crop prices.

  • Counting Logic: The counting logic involves determining how many data points (crop prices) fall within each bin's range. The key here is to ensure that each price is counted exactly once, respecting the inclusive or exclusive nature of bin boundaries.

  • No Overlap: By ensuring that the upper limit of one bin is the starting point of another and treating the upper limit as exclusive (except for the last bin), each price is categorized uniquely. This method avoids the possibility of a price being counted in two bins.

  • Complete Coverage: Inclusive lower bounds and an inclusive upper bound for the last bin ensure that all data points are counted, including the extremes.

  • Loops and Conditions: The for loop iterates over each bin, and the if condition inside the loop applies the correct counting logic based on the bin's position (general or last).

  • Summation and Comparison: The sum(1 for price in prices if condition) pattern is a Pythonic way to count items satisfying a condition, translating directly from the mathematical notion of conditional counting.

Bin Boundaries

Bins Based on Quartiles and Whiskers:

  • Bins are initially set up to capture the range of prices from extreme values (min and max prices) through the distribution (quartiles and whiskers).

  • The bins are created to ensure no overlap, where the end of one bin is the start of another. This is critical for avoiding double-counting.

Inclusive and Exclusive Boundaries:

  • The lower boundary of bins is inclusive, meaning if a price is equal to the lower boundary, it's counted in the bin.

  • The upper boundary of bins (except the last one) is exclusive, meaning prices equal to the upper boundary are not included in the bin but rather in the next bin up.

Counting Prices Within Bins

Iterating Over Bins:

  • The process iterates over each bin, applying a condition to count prices within the bin's boundaries.

Condition for Counting:

  • For most bins, prices are counted if they are greater than or equal to the lower boundary and less than the upper boundary (bin_bounds[1] <= price < bin_bounds[0]).

  • For the last bin, which includes the max price, the upper boundary is inclusive (bin_bounds[1] <= price <= bin_bounds[0]). This ensures the max price is always included.

Creating Bin Data for Output

Recording Bin Information:

  • For each bin, the lower limit, upper limit, and count of prices within those limits are recorded.

  • Limits are adjusted (floor for lower, ceil for upper) to ensure they are integers, reflecting the discrete nature of price counts.

Last updated