Browse
Scrapy
Dateparser
Parsel
cssselect
number-parser
Ideas
This is a list of ideas for student applications.
If you are a student, learn how to participate!
Scrapy
Static Analysis Tooling
Description | While using Scrapy, there are certain common issues that are hard to detect. For example, a typo in the name of a setting. |
Expected Result | Build a list of common issues in code using Scrapy that could be detected using static code analysis, and build a tool or extend an existing tool to detect those. |
Required Skills | Regular Expressions |
Mentors | Adrian |
GitHub Issue | #4421 |
MIME Sniffing Library
Description | HTTP responses should include a |
Expected Result | Create a Python library that implements the complete MIME Sniffing Standard. |
Stretch Goals | Integrate the resulting library into Scrapy. |
Required Skills | HTTP, Interface Design |
Mentors | Adrian |
GitHub Issue | #4240 |
Scrapy FEEDS enhancements
Description | This is a collection of small improvements that have been asked over time for scrapy’s FEEDS delivery. |
Expected Result | Improve scrapy’s FEED delivery capabilities. |
Required Skills | Compression, Interface Design, A bit of scrapy internals |
Mentors | |
GitHub Issue | #4963 |
Handle 429s properly
Description | Currently scrapy doesn’t handle 429s properly. So, whenever we get 429 response code, we should update throttling configs and concurrency to adapt to the new rate. |
Expected Result | A new middleware/extension that will handle 429 response codes and adjust request rates properly. |
Required Skills | HTTP |
Mentors | |
GitHub Issue | #4424 |
Dateparser
Performance Optimizations
Description | We believe there is much room for improvement in the performance of Dateparser. Moreover, the current implementation is not thread-safe. |
Expected Result | Profile and optimize the performance of the library. |
Stretch Goals | Make the library thread-safe. |
Required Skills | Profiling, Algorithms, Data Structures, Multithreading |
Mentors | Marc, Adrian |
GitHub Issue | #624 |
Better Language Detection
Description | Currently language detection is rudimentary and often causes incorrect interpretation of dates. |
Expected Result | Improve how language detection works. Plugging-in an optional language detection library is an option. |
Required Skills | Natural Language Processing |
Mentors | Marc, Adrian |
GitHub Issue | #612 |
Date Search Improvements
Description | There is a long list of issues
that affect the |
Expected Result | Make a plan to refactor and improve the function, fixing some of those issues. |
Stretch Goals | Fix even more of those issues. |
Required Skills | API Design, Regular Expressions |
Mentors | Marc, Adrian |
GitHub Issue | #897 |
Parsel
HTML5 Support
Description | When you inspect a website element in a web browser, you get a DOM-based HTML tree that is different from the actual, underlying HTML tree. This makes it difficult to translate what you find in a web browser into an XPath or CSS expression that can work in Parsel. More so when the underlying HTML is actually broken. |
Expected Result | Extend Parsel to support different HTML parsers, and add support for additional HTML parsers. |
Required Skills | HTML, Interface Design |
Mentors | Andrey, Adrian |
GitHub Issue | #83 |
cssselect
Extend CSS Selectors Level 4 Support
Description | There is a W3C working draft for additional CSS selectors that adds many features |
Expected Result | Extend cssselect to support additional CSS Selectors Level 4 that can be translated into XPath 1.0. |
Required Skills | CSS, XPath 1.0, Syntax Parsing |
Mentors | Andrey, Adrian |
GitHub Issue | #108 |
number-parser
Integrate number-parser into Dateparser and price-parser
Description | number-parser should allow to improve the capabilities of Dateparser and price-parser |
Expected Result | Integrate number-parser into Dateparser and price-parser as a dependency that brings support for natural language numbers to those libraries |
Stretch Goals | Extend the features of number-parser. |
Required Skills | API Design, Regular Expressions, Software Architecture |
Mentors | Marc, Adrian |
GitHub Issue | #61 |