multithreading - Python & web scraping performance -
i trying python based web scraping execution time pretty critical.
i've tried phantomjs, selenium, , pyqt4 now, , 3 libraries have given me similar response times. i'd post example code, problem affects three, believe problem either lies in shared dependency or outside of code. @ around 50 concurrent requests, see huge desegregation in response time. takes 40 seconds 50 pages, , time gets exponentially slower greater page demands. ideally i'm looking ~200+ requests in 10 seconds. used multiprocessing spawn each instance of phantonjs/pyqt4/selenium, each url request gets it's own instance i'm not blocked single threading.
i don't believe it's hardware bottleneck, it's running on 32 dedicated cpu cores, totaling 64 threads, , cpu usage doesn't typically spike on 10-12%. bandwidth sits reasonably comfortably @ around 40-50% of total throughput.
i've read gil, believe i've addressed using multiprocessing. webscraping inherently slow thing? should stop expecting pull 200ish webpages in ~10 seconds?
my overall question is, best approach high performance web scraping, evaluating js on webpage requirement?
"evaluating js on webpage requirement" <- think problem right here. downloading 50 web pages trivially parallelized , should take long slowest server takes respond. now, spawning 50 javascript engines in parallel (which guess doing) run scripts on every page different matter. imagine firing 50 chrome browsers @ same time.
anyway: profile , measure parts of application find bottleneck lies. can see if you're dealing i/o bottleneck (sounds unlikely), cpu bottleneck (more likely) or global lock somewhere serializes stuff (also impossible without code posted)
Comments
Post a Comment