Achieve 200 contributions in Apache Airflow

Category Tech

This is more like a personal reflection, and I really doubt it would benefit anyone. But it's my blog anyway. I can write whatever I want, lol.

There are actually 202 now, but I have the screenshot when it achieved 200, so I'll just keep 200 in the title.

2024-07-30-05-15-PM-200-PRs

The merged pull request is at 166 (+2). I am unsure about the remaining 34 contributions, perhaps due to the suggestions I provided for the PRs I reviewed. There are 2 PRs in question.

  1. Add dataset alias unique constraint and remove wrong dataset alias removing logic
  2. set "has_outlet_datasets" to true if "dataset alias" exists

200 PRs seems like a good opportunity for me to reflect on what I have done on the Airflow project since I joined Astronomer. There might be typos in the PR title, but I'll keep it as it is. I try to group related things into subgroups, and there might be things that cannot be easily categorized. I'm just putting it in the "misc" section.

The count of PRs might appear to be higher than the value I added (and it actually is). This is due to my development habits. Whenever possible, I prefer to keep the commits small and clean. It's easier to revert if I did something dumb and wrong. However, I must admit I probably created too many PRs for the Azure managed identity feature. Ideally, the feature PRs should include documentation updates as well. But yep, I was eager to land the feature first, then. It also suggested having separate PRs for each airflow provider even if it's basically the same feature.

After joining Astronomer

Add "DatasetAlias" for creating datasets or dataset events in runtime

  1. Check dataset_alias in inlets when use it to retrieve inlet_evnets
  2. Add string representation to dataset alias
  3. add example dag for dataset_alias
  4. add test case test_dag_deps_datasets_with_duplicate_dataset
  5. Extend dataset dependencies
  6. Extend get datasets endpoint to include dataset aliases
  7. Retrieve inlet dataset events through dataset aliases
  8. Link dataset event to dataset alias
  9. Scheduling based on dataset aliases
  10. Add DatasetAlias to support dynamic Dataset Event Emission and Dataset Creation

Start task execution directly from the trigger

  1. fix: add argument include_xcom in method rsolve an optional value
  2. Add start execution from trigger support for existing core sensors
  3. Enhance start_trigger_args serialization
  4. State the limitation of the newly added start execution from trigger feature
  5. add next_kwargs to StartTriggerArgs
  6. Add start execution from triggerer support to dynamic task mapping
  7. Prevent start trigger initialization in scheduler
  8. Starts execution directly from triggerer without going to worker (PR of the month)

Add REST API endpoint to manipulate queued dataset events

  1. add section "Manipulating queued dataset events through REST API"
  2. add "queuedEvent" endpoint to get/delete DatasetDagRunQueue

Upgrade apache-airflow-providers-weaviate to 2.0.0 for weaviate-client >= 4.4.0 support

  1. extract collection_name from system tests and make them unique
  2. fix weaviate system tests
  3. Upgrade to weaviate-client to v4

Improve trigger stability by adding "return" after "yield"

  1. add "return" statement to "yield" within a while loop in amazon triggers
  2. add "return" statement to "yield" within a while loop in dbt triggers
  3. add "return" statement to "yield" within a while loop in google triggers
  4. add "return" statement to "yield" within a while loop in azure triggers
  5. add "return" statement to "yield" within a while loop in http triggers
  6. add "return" statement to "yield" within a while loop in sftp triggers
  7. add "return" statement to "yield" within a while loop in airbyte triggers
  8. add "return" statement to "yield" within a while loop in core triggers
  9. retrieve dataset event created through RESTful API when creating dag run

Contribute astronomer-providers functionality to apache-airflow providers

  1. add repair_run support to DatabricksRunNowOperator in deferrable mode
  2. remove redundant else block in DatabricksExecutionTrigger
  3. add reuse_existing_run for allowing DbtCloudRunJobOperator to reuse existing run
  4. fix how GKEPodAsyncHook.service_file_as_context is used
  5. add service_file support to GKEPodAsyncHook
  6. reword GoogleBaseHookAsync as GoogleBaseAsyncHook in docstring
  7. add WasbPrefixSensorTrigger params breaking change to azure provider changelog
  8. style(providers/google): improve BigQueryInsertJobOperator type hinting
  9. Check cluster state before defer Dataproc operators to trigger
  10. Fix WasbPrefixSensor arg inconsistency between sync and async mode
  11. avoid retrying after KubernetesPodOperator has been marked as failed
  12. check sagemaker training job status before deferring SageMakerTrainingOperator
  13. check transform job status before deferring SageMakerTransformOperator
  14. check ProcessingJobStatus status before deferring SageMakerProcessingOperator
  15. add deferrable mode to RedshiftDataOperator
  16. add use_regex argument for allowing S3KeySensor to check s3 keys with regular expression
  17. add deferrable mode to RedshiftClusterSensor
  18. check job_status before BatchOperator execute in deferrable mode
  19. remove event['message'] call in EmrContainerOperator.execute_complete|as the key message no longer exists
  20. Check redshift cluster state before deferring to triggerer
  21. handle tzinfo in S3Hook.is_keys_unchanged_async
  22. add type annotations to Amazon provider "execute_coplete" methods
  23. iterate through blobs before checking prefixes

Add Azure managed identities support to apache-airflow-providers-microsoft-azure

  1. setting use_async=True for get_async_default_azure_credential
  2. add managed identity support to AsyncDefaultAzureCredential
  3. Refactor azure managed identity
  4. add managed identity support to fileshare hook
  5. add managed identity support to synapse hook
  6. add managed identity support to azure datalake hook
  7. add managed identity support to azure batch hook
  8. add managed identity support to wasb hook
  9. add managed identity support to adx hook
  10. add managed identity support to asb hook
  11. add managed identity support to azure cosmos hook
  12. add managed identity support to azure data factory hook
  13. add managed identity support to azure container volume hook
  14. add managed identity support to azure container registry hook
  15. add managed identity support to azure container instance hook
  16. Reuse get_default_azure_credential method from Azure utils for Azure key valut
  17. make DefaultAzureCredential configurable in AzureKeyVaultBackend
  18. Make DefaultAzureCredential in AzureBaseHook configuration
  19. docs(providers/microsoft): improve documentation for AzureContainerVolumeHook DefaultAzureCredential support
  20. docs(providers/microsoft): improve documentation for WasbHook DefaultAzureCredential support
  21. docs(providers/microsoft): improve documentation for AzureCosmosDBHook DefaultAzureCredential support
  22. docs(providers/microsoft): improve documentation for AzureFileShareHook DefaultAzureCredential support
  23. docs(providers/microsoft): improve documentation for AzureBatchHook DefaultAzureCredential support
  24. docs(providers/microsoft): improve documentation for AzureBaseHook DefaultAzureCredential support
  25. docs(providers/microsoft): improve documentation for Azure Service Bus hooks DefaultAzureCredential support
  26. docs(providers/microsoft): improve documentation for AzureDataExplorerHook DefaultAzureCredential support
  27. docs(providers/microsoft): improve documentation for AzureDataLakeStorageV2Hook DefaultAzureCredential support
  28. docs(providers/microsoft): improve documentation for AzureDataLakeHook DefaultAzureCredential support
  29. docs(providers/microsoft): improve documentation for AzureContainerRegistryHook DefaultAzureCredential support
  30. feat(providers/microsoft): add AzureContainerInstancesOperator.volume as a template field
  31. test(providers/microsfot): add system test for AzureContainerVolumeHook and AzureContainerRegistryHook
  32. docs(providers): replace markdown style link with rst style link for amazon and apache-beam
  33. test(providers/microsoft): add test cases to AzureContainerInstanceHook
  34. Add DefaultAzureCredential support to AzureContainerRegistryHook
  35. feat(providers/microsoft): add DefaultAzureCredential support to AzureContainerVolumeHook
  36. Add AzureBatchOperator example
  37. test(providers/microsoft): add test case for AzureIdentityCredentialAdapter.signed_session
  38. fix(providers/azure): remove json.dumps when querying AzureCosmosDBHook
  39. feat(providers/azure): allow passing fully_qualified_namespace and credential to initialize Azure Service Bus Client
  40. feat(providers/microsoft): add DefaultAzureCredential support to AzureBatchHook
  41. feat(providers/microsoft): add DefaultAzureCredential support to AzureContainerInstanceHook
  42. feat(providers/microsoft): add DefaultAzureCredential support to cosmos
  43. feat(providers/microsoft): add DefaultAzureCredential to data_lake

Make all existing sensors respect the "soft_fail" argument in BaseSensorOperator

  1. respect soft_fail argument when exception is raised for google sensors
  2. respect soft_fail argument when exception is raised for microsoft-azure sensors
  3. respect soft_fail argument when exception is raised for flink sensors
  4. respect soft_fail argument when exception is raised for jenkins sensors
  5. respect soft_fail argument when exception is raised for celery sensors
  6. Fix inaccurate test case names in providers
  7. respect soft_fail argument when exception is raised for datadog sensors
  8. respect soft_fail argument when exception is raised for http sensors
  9. respect soft_fail argument when exception is raised for sql sensors
  10. respect soft_fail argument when exception is raised for sftp sensors
  11. respect soft_fail argument when exception is raised for spark-kubernetes sensors
  12. respect soft_fail argument when exception is raised for google-marketing-platform sensors
  13. respect soft_fail argument when exception is raised for dbt sensors
  14. respect soft_fail argument when exception is raised for tableau sensors
  15. respect soft_fail argument when exception is raised for ftp sensors
  16. respect soft_fail argument when exception is raised for alibaba sensors
  17. respect soft_fail argument when exception is raised for airbyte sensors
  18. respect soft_fail argument when exception is raised for amazon sensors
  19. respect "soft_fail" argument when running BatchSensor in deferrable mode
  20. Respect "soft_fail" for core async sensors
  21. Respect "soft_fail" argument when "poke" is called
  22. respect soft_fail argument when ExternalTaskSensor runs in deferrable mode

Add defult_deferrable configuration for easily turning on the deferrable mode of operators

  1. build(pre-commit): add list of supported deferrable operators to doc
  2. build(pre-commit): check deferrable default value
  3. Add default_deferrable config (PR of the month)

Security improvement

  1. Disable rendering for doc_md
  2. check whether AUTH_ROLE_PUBLIC is set in check_authentication
  3. check whether AUTH_ROLE_PUBLIC is set in check_authentication
  4. fix(api_connexion): handle the cases that webserver.expose_config is set to "non-sensitive-only" instead of boolean value

Misc (core)

  1. catch sentry flush if exception happens in _execute_in_fork finally block
  2. add PID and return code to _execute_in_fork logging
  3. add missing conn_id to string representation of ObjectStoragePath
  4. Enable "airflow tasks test" to run deferrable operator
  5. remove "to backfill" from --task-regex argument help message
  6. fix(sensors): move trigger initialization from __init___ to execute
  7. Ship zombie info
  8. Catch the exception that triggerer initialization failed
  9. feat(jobs/triggerer_job_runner): add triggerer canceled log
  10. fixing circular import error in providers caused by airflow version check

Misc (provider)

  1. add default gcp_conn_id to GoogleBaseAsyncHook
  2. remove unexpected argument pod in read_namespaced_pod_log call
  3. fix wrong payload set when reuse_existing_run set to True in DbtCloudRunJobOperator
  4. migrate to dbt v3 api for project endpoints
  5. Replace pod_manager.read_pod_logs with client.read_namespaced_pod_log in KubernetesPodOperator._write_logs
  6. allow providing credentials through keyword argument in AzureKeyVaultBackend
  7. Fix outdated test name and description in BatchSensor
  8. add deprecation warning to DATAPROC_JOB_LOG_LINK
  9. Alias DATAPROC_JOB_LOG_LINK to DATAPROC_JOB_LINK
  10. Remove execute function of DatabricksRunNowDeferrableOperator
  11. Add missing execute_complete method for DatabricksRunNowOperator
  12. refresh connection if an exception is caught in "AzureDataFactory"
  13. feat(providers/azure): cancel pipeline if unexpected exception caught
  14. fix(providers/amazon): handle missing LogUri in emr describe_cluster API response
  15. merge AzureDataFactoryPipelineRunStatusAsyncSensor to AzureDataFactoryPipelineRunStatusSensor
  16. merge BigQueryTableExistenceAsyncSensor into BigQueryTableExistenceSensor
  17. Merge BigQueryTableExistencePartitionAsyncSensor into BigQueryTableExistencePartitionSensor
  18. Merge DbtCloudJobRunAsyncSensor logic to DbtCloudJobRunSensor
  19. Merge GCSObjectExistenceAsyncSensor logic to GCSObjectExistenceSensor

Misc (doc only)

  1. Add in Trove classifiers Python 3.12 support
  2. add Wei Lee to committer list (This is my 133rd PR)
  3. Erd generating doc improvement
  4. fix rst code block format
  5. docs(core-airflow): replace markdown style link with rst style link
  6. docs(CONTRIBUTING): replace markdown style link with rst style link
  7. docs: fix partial doc reference error due to missing space
  8. docs(deferring): add type annotation to code examples
  9. add a note that we'll need to restart triggerer to reflect any trigger change

Before Joining Astronomer

  1. update contributing documentations