We pitch that a more comprehensive solution is required and hence the need for our service. Companies and universities often use NATs and your algorithm could mistake them as one person.įor full disclosure, I am a cofounder of Distil Networks and we often poke holes in WAF like rate limiting. The other issue is that you could potentially block people using a shared IP. Even off the shelf scraping software like use a huge block of IPs and rotate through them to solve this problem. First is people's ability to use proxy networks or TOR to anonymize each request. One such algorithm that already exists is the Leaky Bucket Algorith ( ).Īs far as rate limiting to stop web scraping, there are two flaws in trying to rate limit connections. You should use the basic methodologies already employed by WAF's to rate limit connections. If you are asking specifically to the validity of your algorithm, it isnt bad but it seems like you are over complicating it. What do you think? I'll appreciate if you'll try this on your services. Сalibration of type 1 is best for repeating requests, type 2 for repeating request with few intervals, type 3 for non-constant request intervals. norm(X) = max(X), k is square root of number of non-empty elements XĬ is in range (0.1), 0 means there is no behavior deviation and 1 is max deviation.Where X is certain client data, Y is common data, and norm() is calibration function, and k is normalization coefficient, depending on type of norm(). Criteria is parameter C: C = sqrt(summ((X/norm(X) - Y/norm(Y))^2)/k) Then, after I got enough number of them in X and Y, it's time to make decision. Where N is time (count) limit, intervals greater than N are dropped. To be more precise, I collect time intervals between requests into array, indexed by function of time: i = (integer) ln(interval + 1) / ln(N + 1) * N + 1 The base is to collect request timestamps of certain client side and compare their behavior pattern with common pattern or precomputed pattern. I tried detection based on behavior patterns, and it seems to be promising, although relatively computing heavy. I need to detect scraping of info on my website.
0 Comments
Leave a Reply. |