Go to top

Ideas

This is a list of ideas for student applications.

If you are a student, learn how to participate!

Scrapy

Static Analysis Tooling

Easy
Description

While using Scrapy, there are certain common issues that are hard to detect. For example, a typo in the name of a setting.

Expected Result

Build a list of common issues in code using Scrapy that could be detected using static code analysis, and build a tool or extend an existing tool to detect those.

Required Skills Regular Expressions
Mentors Adrian
GitHub Issue #4421

MIME Sniffing Library

Intermediate
Description

HTTP responses should include a Content-Type header that indicates the MIME type of the response body. However, responses do not always include such a header, and sometimes they include it but the specified MIME type does not really match the response body.

Expected Result

Create a Python library that implements the complete MIME Sniffing Standard.

Stretch Goals

Integrate the resulting library into Scrapy.

Required Skills HTTP, Interface Design
Mentors Adrian
GitHub Issue #4240

Scrapy FEEDS enhancements

Intermediate
Description

This is a collection of small improvements that have been asked over time for scrapy’s FEEDS delivery.

Expected Result

Improve scrapy’s FEED delivery capabilities.

Required Skills Compression, Interface Design, A bit of scrapy internals
Mentors Julio
GitHub Issue #4963

Handle 429s properly

Easy
Description

Currently scrapy doesn’t handle 429s properly. So, whenever we get 429 response code, we should update throttling configs and concurrency to adapt to the new rate.

Expected Result

A new middleware/extension that will handle 429 response codes and adjust request rates properly.

Required Skills HTTP
Mentors Julio
GitHub Issue #4424

Dateparser

Performance Optimizations

Intermediate
Description

We believe there is much room for improvement in the performance of Dateparser. Moreover, the current implementation is not thread-safe.

Expected Result

Profile and optimize the performance of the library.

Stretch Goals

Make the library thread-safe.

Required Skills Profiling, Algorithms, Data Structures, Multithreading
Mentors Adrian
GitHub Issue #624

Better Language Detection

Intermediate
Description

Currently language detection is rudimentary and often causes incorrect interpretation of dates.

Expected Result

Improve how language detection works. Plugging-in an optional language detection library is an option.

Required Skills Natural Language Processing
Mentors Adrian
GitHub Issue #612

Parsel

HTML5 Support

Easy
Description

When you inspect a website element in a web browser, you get a DOM-based HTML tree that is different from the actual, underlying HTML tree. This makes it difficult to translate what you find in a web browser into an XPath or CSS expression that can work in Parsel. More so when the underlying HTML is actually broken.

Expected Result

Extend Parsel to support different HTML parsers, and add support for additional HTML parsers.

Required Skills HTML, Interface Design
Mentors Andrey, Adrian
GitHub Issue #83

cssselect

Extend CSS Selectors Level 4 Support

Advanced
Description

There is a W3C working draft for additional CSS selectors that adds many features

Expected Result

Extend cssselect to support additional CSS Selectors Level 4 that can be translated into XPath 1.0.

Required Skills CSS, XPath 1.0, Syntax Parsing
Mentors Andrey, Adrian
GitHub Issue #108