multithreading - Python & web scraping performance -

- August 15, 2012

i trying python based web scraping execution time pretty critical.

i've tried phantomjs, selenium, , pyqt4 now, , 3 libraries have given me similar response times. i'd post example code, problem affects three, believe problem either lies in shared dependency or outside of code. @ around 50 concurrent requests, see huge desegregation in response time. takes 40 seconds 50 pages, , time gets exponentially slower greater page demands. ideally i'm looking ~200+ requests in 10 seconds. used multiprocessing spawn each instance of phantonjs/pyqt4/selenium, each url request gets it's own instance i'm not blocked single threading.

i don't believe it's hardware bottleneck, it's running on 32 dedicated cpu cores, totaling 64 threads, , cpu usage doesn't typically spike on 10-12%. bandwidth sits reasonably comfortably @ around 40-50% of total throughput.

i've read gil, believe i've addressed using multiprocessing. webscraping inherently slow thing? should stop expecting pull 200ish webpages in ~10 seconds?

my overall question is, best approach high performance web scraping, evaluating js on webpage requirement?

"evaluating js on webpage requirement" <- think problem right here. downloading 50 web pages trivially parallelized , should take long slowest server takes respond. now, spawning 50 javascript engines in parallel (which guess doing) run scripts on every page different matter. imagine firing 50 chrome browsers @ same time.

anyway: profile , measure parts of application find bottleneck lies. can see if you're dealing i/o bottleneck (sounds unlikely), cpu bottleneck (more likely) or global lock somewhere serializes stuff (also impossible without code posted)

Search This Blog

ANy

multithreading - Python & web scraping performance -

Comments

Post a Comment

Popular posts from this blog

ZeroMQ on Windows, with Qt Creator -

ios - MKAnnotationView layer is not of expected type: MKLayer -

python - Error while using APScheduler: 'NoneType' object has no attribute 'now' -